Improve mzMLb read performance with array cache

mobiusklein commented 3 years ago

Sure enough, the feature was merged and then I started doing more serious performance testing. It appears that h5py does not attempt to do serious caching of decompressed chunks in memory for simplicity, so each attempt to read a slice of the dataset requires decompressing a whole chunk of data and then taking the requested slice on demand every time, making sequential reading of the mzMLb file orders of magnitude slower than reading the equivalent compressed mzML.

This feature adds an extra intermediate caching layer to the MzMLb object which caches a chunk of each array in memory under the assumption that the next read will be in the same region. This boosts the performance to being twice as fast as the equivalent mzML file.

levitsky commented 3 years ago

In a very quick test with a file converted from mzML to mzMLb, the updated parser is almost 2x faster than without caching, yet the original mzML file is parsed about 5x faster. I wonder what the difference is.

mobiusklein commented 3 years ago

There's a difference in compression levels between test.mzML and test.mzMLb. All the buffers in test.mzMLb are gzip compressed. There's likely still some inefficiency. I will need to carry out a more thorough benchmark with large data files, larger than what would go in a unit test anyway.

levitsky commented 3 years ago

To clarify, I didn't test with test.mzML and test.mzMLb, rather some arbitrary mzML files of ~100 MB and ~1 GB I had lying around. I thought it might have to do with compression differences in mzML, but I don't think it affects mzml performance all that much, at least not zlib. Anyway, you're right that more benchmarking is needed even if it makes sense at all to make speed comparisons with mzML parsing here. Meanwhile, I don't see why this PR should be waiting here, I'll be happy to merge it as is.

mobiusklein commented 3 years ago

Sounds good to me then. I'll get back to you on benchmarks soon.

mobiusklein commented 3 years ago

Since I saw this wasn't merged yet, I thought I'd add another optimization.

I found that even after reducing the overhead of reading chunks of the numerical data arrays, I could reduce the number of times we read from the HDF5 file directly by adding buffering to the XML byte buffer. That suffered from the same repeated __getitem__ cost as the array slicing did. This way, we only hit the HDF5 file a few times during a sequential read through instead of multiple times per iteration.

I think h5py has been planning to implement better caching for a long time, but is thrashing between implementing it themselves vs. rebasing themselves off PyTables.

mobiusklein commented 3 years ago

Benchmarked on a full .raw file I had lying around:

mzMLb with blosc compression (1 GB)

In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzMLb")
In [3]: %%prun
   ...: for scan in reader2:
   ...:     pass
   ...:
         34527800 function calls (33607805 primitive calls) in 30.914 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
436268/41270    7.102    0.000   24.059    0.001 xml.py:403(_get_info)
     6182    4.516    0.001    4.865    0.001 dataset.py:688(__getitem__)
    41271    4.324    0.000   30.831    0.001 xml.py:528(_iterfind_impl)
  1190912    4.258    0.000    7.041    0.000 xml.py:352(_handle_param)
  6508886    2.449    0.000    3.267    0.000 xml.py:51(_local_name)
436268/41270    1.179    0.000   24.241    0.001 mzml.py:268(_get_info_smart)
  3219488    1.043    0.000    1.043    0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
  6516852    0.820    0.000    0.820    0.000 {method 'rpartition' of 'str' objects}
  1190912    0.472    0.000    0.472    0.000 structures.py:345(__new__)
  1190912    0.375    0.000    0.969    0.000 structures.py:288(__new__)
    82540    0.344    0.000    6.222    0.000 mzmlb.py:226(_handle_binary)
  1190912    0.311    0.000    0.311    0.000 xml.py:379(_insert_param)
        1    0.287    0.287   31.185   31.185 <string>:1(<module>)
    82540    0.228    0.000    1.162    0.000 mzml.py:235(_handle_binary)
    82540    0.218    0.000    0.218    0.000 {built-in method numpy.array}
    82540    0.196    0.000    0.310    0.000 mzml.py:116(_detect_array_name)
  2305750    0.194    0.000    0.194    0.000 {built-in method builtins.isinstance}
    82540    0.187    0.000    0.195    0.000 mzml.py:210(_determine_compression)
  1559546    0.167    0.000    0.167    0.000 {method 'get' of 'dict' objects}
   540438    0.160    0.000    0.296    0.000 structures.py:313(__new__)
    82540    0.145    0.000    0.153    0.000 mzml.py:190(_determine_array_dtype)
   296746    0.142    0.000    0.372    0.000 xml.py:151(str_to_num)
  1446388    0.137    0.000    0.137    0.000 {method 'append' of 'list' objects}
   296746    0.125    0.000    0.230    0.000 structures.py:268(__new__)
  1391790    0.119    0.000    0.119    0.000 {method 'items' of 'dict' objects}
     6182    0.118    0.000    0.119    0.000 dataset.py:466(_fast_reader)
   296746    0.118    0.000    0.490    0.000 xml.py:157(convert_from)

mzML uncompressed (1.5 GB)

In [4]: reader = mzml.MzML("./AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzML")
In [5]: %%prun
   ...: for scan in reader:
   ...:     pass
   ...:
         25450028 function calls (24494951 primitive calls) in 37.147 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
436268/41270    8.323    0.000   28.340    0.001 xml.py:403(_get_info)
    41270    6.870    0.000    7.496    0.000 xml.py:622(_find_by_id_no_reset)
    82540    5.872    0.000    5.872    0.000 {built-in method zlib.decompress}
   984561    3.647    0.000    5.830    0.000 xml.py:352(_handle_param)
436268/41270    2.487    0.000   28.558    0.001 mzml.py:268(_get_info_smart)
    82540    2.266    0.000    2.266    0.000 {built-in method binascii.a2b_base64}
  2765038    1.020    0.000    1.020    0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
  2841658    0.640    0.000    0.640    0.000 xml.py:51(_local_name)
   984561    0.391    0.000    0.391    0.000 structures.py:345(__new__)
   984561    0.313    0.000    0.885    0.000 structures.py:288(__new__)
    82540    0.290    0.000    9.004    0.000 utils.py:170(decode_data_array)
   984561    0.269    0.000    0.269    0.000 xml.py:379(_insert_param)
    82540    0.245    0.000    9.923    0.000 mzml.py:235(_handle_binary)
   984561    0.225    0.000    0.409    0.000 <string>:1(__new__)
    41270    0.223    0.000    0.443    0.000 ntpath.py:450(normpath)
  2423094    0.221    0.000    0.221    0.000 {built-in method builtins.isinstance}
    82540    0.186    0.000    0.194    0.000 mzml.py:210(_determine_compression)
    82540    0.173    0.000    0.266    0.000 mzml.py:116(_detect_array_name)
  1515154    0.173    0.000    0.173    0.000 {method 'get' of 'dict' objects}
    41270    0.171    0.000   36.872    0.001 file_helpers.py:78(wrapped)
   499168    0.160    0.000    0.316    0.000 structures.py:313(__new__)
    41270    0.156    0.000   36.367    0.001 xml.py:1107(get_by_id)
   123810    0.151    0.000    0.151    0.000 {method 'seek' of '_io.BufferedReader' objects}
   296746    0.149    0.000    0.386    0.000 xml.py:151(str_to_num)
    82540    0.147    0.000    0.154    0.000 mzml.py:190(_determine_array_dtype)

mobiusklein commented 3 years ago

And to try out a better compressor, mzMLb with blosc:zstd. 700 MB on disk. That's less than the mzML file compressed with gzip (970 MB).

In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.zstd.mzMLb")
In [3]: %%prun
   ...: for scan in reader2:
   ...:     pass
   ...:
         34527932 function calls (33607934 primitive calls) in 33.959 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     6185    7.568    0.001    7.940    0.001 dataset.py:688(__getitem__)
436268/41270    7.047    0.000   27.003    0.001 xml.py:403(_get_info)
    41271    4.382    0.000   33.879    0.001 xml.py:528(_iterfind_impl)
  1190912    4.264    0.000    7.007    0.000 xml.py:352(_handle_param)
  6508886    2.438    0.000    3.261    0.000 xml.py:51(_local_name)
436268/41270    1.185    0.000   27.183    0.001 mzml.py:268(_get_info_smart)
  3219488    1.040    0.000    1.040    0.000 {built-in method __new__ of type object at 0x00007FF9AB3BC810}
  6516855    0.826    0.000    0.826    0.000 {method 'rpartition' of 'str' objects}
  1190912    0.440    0.000    0.440    0.000 structures.py:345(__new__)
  1190912    0.358    0.000    0.961    0.000 structures.py:288(__new__)
    82540    0.344    0.000    9.253    0.000 mzmlb.py:226(_handle_binary)
  1190912    0.305    0.000    0.305    0.000 xml.py:379(_insert_param)
        1    0.285    0.285   34.232   34.232 <string>:1(<module>)
    82540    0.223    0.000    1.152    0.000 mzml.py:235(_handle_binary)
    82540    0.223    0.000    0.223    0.000 {built-in method numpy.array}
    82540    0.201    0.000    0.313    0.000 mzml.py:116(_detect_array_name)
  2305771    0.194    0.000    0.194    0.000 {built-in method builtins.isinstance}
    82540    0.182    0.000    0.190    0.000 mzml.py:210(_determine_compression)
  1559546    0.165    0.000    0.165    0.000 {method 'get' of 'dict' objects}
   540438    0.163    0.000    0.298    0.000 structures.py:313(__new__)
   296746    0.147    0.000    0.360    0.000 xml.py:151(str_to_num)
    82540    0.138    0.000    0.146    0.000 mzml.py:190(_determine_array_dtype)
  1446388    0.136    0.000    0.136    0.000 {method 'append' of 'list' objects}
   296746    0.132    0.000    0.491    0.000 xml.py:157(convert_from)
     6185    0.129    0.000    0.130    0.000 dataset.py:466(_fast_reader)
  1391790    0.120    0.000    0.120    0.000 {method 'items' of 'dict' objects}
   296746    0.119    0.000    0.213    0.000 structures.py:268(__new__)
   996346    0.118    0.000    0.118    0.000 {method 'pop' of 'dict' objects}
    82540    0.090    0.000    7.642    0.000 mzmlb.py:325(get)
   495240    0.077    0.000    0.077    0.000 {method 'endswith' of 'str' objects}

levitsky commented 3 years ago

Nice! Now I see parsing times faster than for mzML, as well. Thank you!

levitsky / pyteomics

Improve mzMLb read performance with array cache #38