Closed mobiusklein closed 3 years ago
In a very quick test with a file converted from mzML to mzMLb, the updated parser is almost 2x faster than without caching, yet the original mzML file is parsed about 5x faster. I wonder what the difference is.
There's a difference in compression levels between test.mzML
and test.mzMLb
. All the buffers in test.mzMLb
are gzip compressed. There's likely still some inefficiency. I will need to carry out a more thorough benchmark with large data files, larger than what would go in a unit test anyway.
To clarify, I didn't test with test.mzML and test.mzMLb, rather some arbitrary mzML files of ~100 MB and ~1 GB I had lying around. I thought it might have to do with compression differences in mzML, but I don't think it affects mzml performance all that much, at least not zlib. Anyway, you're right that more benchmarking is needed even if it makes sense at all to make speed comparisons with mzML parsing here. Meanwhile, I don't see why this PR should be waiting here, I'll be happy to merge it as is.
Sounds good to me then. I'll get back to you on benchmarks soon.
Since I saw this wasn't merged yet, I thought I'd add another optimization.
I found that even after reducing the overhead of reading chunks of the numerical data arrays, I could reduce the number of times we read from the HDF5 file directly by adding buffering to the XML byte buffer. That suffered from the same repeated __getitem__
cost as the array slicing did. This way, we only hit the HDF5 file a few times during a sequential read through instead of multiple times per iteration.
I think h5py
has been planning to implement better caching for a long time, but is thrashing between implementing it themselves vs. rebasing themselves off PyTables
.
Benchmarked on a full .raw file I had lying around:
mzMLb with blosc
compression (1 GB)
In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzMLb")
In [3]: %%prun
...: for scan in reader2:
...: pass
...:
34527800 function calls (33607805 primitive calls) in 30.914 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
436268/41270 7.102 0.000 24.059 0.001 xml.py:403(_get_info)
6182 4.516 0.001 4.865 0.001 dataset.py:688(__getitem__)
41271 4.324 0.000 30.831 0.001 xml.py:528(_iterfind_impl)
1190912 4.258 0.000 7.041 0.000 xml.py:352(_handle_param)
6508886 2.449 0.000 3.267 0.000 xml.py:51(_local_name)
436268/41270 1.179 0.000 24.241 0.001 mzml.py:268(_get_info_smart)
3219488 1.043 0.000 1.043 0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
6516852 0.820 0.000 0.820 0.000 {method 'rpartition' of 'str' objects}
1190912 0.472 0.000 0.472 0.000 structures.py:345(__new__)
1190912 0.375 0.000 0.969 0.000 structures.py:288(__new__)
82540 0.344 0.000 6.222 0.000 mzmlb.py:226(_handle_binary)
1190912 0.311 0.000 0.311 0.000 xml.py:379(_insert_param)
1 0.287 0.287 31.185 31.185 <string>:1(<module>)
82540 0.228 0.000 1.162 0.000 mzml.py:235(_handle_binary)
82540 0.218 0.000 0.218 0.000 {built-in method numpy.array}
82540 0.196 0.000 0.310 0.000 mzml.py:116(_detect_array_name)
2305750 0.194 0.000 0.194 0.000 {built-in method builtins.isinstance}
82540 0.187 0.000 0.195 0.000 mzml.py:210(_determine_compression)
1559546 0.167 0.000 0.167 0.000 {method 'get' of 'dict' objects}
540438 0.160 0.000 0.296 0.000 structures.py:313(__new__)
82540 0.145 0.000 0.153 0.000 mzml.py:190(_determine_array_dtype)
296746 0.142 0.000 0.372 0.000 xml.py:151(str_to_num)
1446388 0.137 0.000 0.137 0.000 {method 'append' of 'list' objects}
296746 0.125 0.000 0.230 0.000 structures.py:268(__new__)
1391790 0.119 0.000 0.119 0.000 {method 'items' of 'dict' objects}
6182 0.118 0.000 0.119 0.000 dataset.py:466(_fast_reader)
296746 0.118 0.000 0.490 0.000 xml.py:157(convert_from)
mzML uncompressed (1.5 GB)
In [4]: reader = mzml.MzML("./AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.mzML")
In [5]: %%prun
...: for scan in reader:
...: pass
...:
25450028 function calls (24494951 primitive calls) in 37.147 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
436268/41270 8.323 0.000 28.340 0.001 xml.py:403(_get_info)
41270 6.870 0.000 7.496 0.000 xml.py:622(_find_by_id_no_reset)
82540 5.872 0.000 5.872 0.000 {built-in method zlib.decompress}
984561 3.647 0.000 5.830 0.000 xml.py:352(_handle_param)
436268/41270 2.487 0.000 28.558 0.001 mzml.py:268(_get_info_smart)
82540 2.266 0.000 2.266 0.000 {built-in method binascii.a2b_base64}
2765038 1.020 0.000 1.020 0.000 {built-in method __new__ of type object at 0x00007FF9AB5BC810}
2841658 0.640 0.000 0.640 0.000 xml.py:51(_local_name)
984561 0.391 0.000 0.391 0.000 structures.py:345(__new__)
984561 0.313 0.000 0.885 0.000 structures.py:288(__new__)
82540 0.290 0.000 9.004 0.000 utils.py:170(decode_data_array)
984561 0.269 0.000 0.269 0.000 xml.py:379(_insert_param)
82540 0.245 0.000 9.923 0.000 mzml.py:235(_handle_binary)
984561 0.225 0.000 0.409 0.000 <string>:1(__new__)
41270 0.223 0.000 0.443 0.000 ntpath.py:450(normpath)
2423094 0.221 0.000 0.221 0.000 {built-in method builtins.isinstance}
82540 0.186 0.000 0.194 0.000 mzml.py:210(_determine_compression)
82540 0.173 0.000 0.266 0.000 mzml.py:116(_detect_array_name)
1515154 0.173 0.000 0.173 0.000 {method 'get' of 'dict' objects}
41270 0.171 0.000 36.872 0.001 file_helpers.py:78(wrapped)
499168 0.160 0.000 0.316 0.000 structures.py:313(__new__)
41270 0.156 0.000 36.367 0.001 xml.py:1107(get_by_id)
123810 0.151 0.000 0.151 0.000 {method 'seek' of '_io.BufferedReader' objects}
296746 0.149 0.000 0.386 0.000 xml.py:151(str_to_num)
82540 0.147 0.000 0.154 0.000 mzml.py:190(_determine_array_dtype)
And to try out a better compressor, mzMLb with blosc:zstd
. 700 MB on disk. That's less than the mzML file compressed with gzip
(970 MB).
In [2]: reader2 = mzmlb.MzMLb("AGP_tryptic_300ng_2microscans_glycoproteomics_nCE_27-30.zstd.mzMLb")
In [3]: %%prun
...: for scan in reader2:
...: pass
...:
34527932 function calls (33607934 primitive calls) in 33.959 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
6185 7.568 0.001 7.940 0.001 dataset.py:688(__getitem__)
436268/41270 7.047 0.000 27.003 0.001 xml.py:403(_get_info)
41271 4.382 0.000 33.879 0.001 xml.py:528(_iterfind_impl)
1190912 4.264 0.000 7.007 0.000 xml.py:352(_handle_param)
6508886 2.438 0.000 3.261 0.000 xml.py:51(_local_name)
436268/41270 1.185 0.000 27.183 0.001 mzml.py:268(_get_info_smart)
3219488 1.040 0.000 1.040 0.000 {built-in method __new__ of type object at 0x00007FF9AB3BC810}
6516855 0.826 0.000 0.826 0.000 {method 'rpartition' of 'str' objects}
1190912 0.440 0.000 0.440 0.000 structures.py:345(__new__)
1190912 0.358 0.000 0.961 0.000 structures.py:288(__new__)
82540 0.344 0.000 9.253 0.000 mzmlb.py:226(_handle_binary)
1190912 0.305 0.000 0.305 0.000 xml.py:379(_insert_param)
1 0.285 0.285 34.232 34.232 <string>:1(<module>)
82540 0.223 0.000 1.152 0.000 mzml.py:235(_handle_binary)
82540 0.223 0.000 0.223 0.000 {built-in method numpy.array}
82540 0.201 0.000 0.313 0.000 mzml.py:116(_detect_array_name)
2305771 0.194 0.000 0.194 0.000 {built-in method builtins.isinstance}
82540 0.182 0.000 0.190 0.000 mzml.py:210(_determine_compression)
1559546 0.165 0.000 0.165 0.000 {method 'get' of 'dict' objects}
540438 0.163 0.000 0.298 0.000 structures.py:313(__new__)
296746 0.147 0.000 0.360 0.000 xml.py:151(str_to_num)
82540 0.138 0.000 0.146 0.000 mzml.py:190(_determine_array_dtype)
1446388 0.136 0.000 0.136 0.000 {method 'append' of 'list' objects}
296746 0.132 0.000 0.491 0.000 xml.py:157(convert_from)
6185 0.129 0.000 0.130 0.000 dataset.py:466(_fast_reader)
1391790 0.120 0.000 0.120 0.000 {method 'items' of 'dict' objects}
296746 0.119 0.000 0.213 0.000 structures.py:268(__new__)
996346 0.118 0.000 0.118 0.000 {method 'pop' of 'dict' objects}
82540 0.090 0.000 7.642 0.000 mzmlb.py:325(get)
495240 0.077 0.000 0.077 0.000 {method 'endswith' of 'str' objects}
Nice! Now I see parsing times faster than for mzML, as well. Thank you!
Sure enough, the feature was merged and then I started doing more serious performance testing. It appears that
h5py
does not attempt to do serious caching of decompressed chunks in memory for simplicity, so each attempt to read a slice of the dataset requires decompressing a whole chunk of data and then taking the requested slice on demand every time, making sequential reading of the mzMLb file orders of magnitude slower than reading the equivalent compressed mzML.This feature adds an extra intermediate caching layer to the
MzMLb
object which caches a chunk of each array in memory under the assumption that the next read will be in the same region. This boosts the performance to being twice as fast as the equivalent mzML file.