levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

mzML vs mzMLb vs mzXML #40

Open sorenwacker opened 3 years ago

sorenwacker commented 3 years ago

Hi,

I made a quick benchmark with some metabolomics files and measured the read speed of various formats. For some reason I get the worst performance with mzMLb format. I wonder if you ever made a benchmark to compare the read speed in pyteomics with other programs.

read-speed-ms-file-formats

The orange bar is the bare read into memory. Without any conversion into a structured dataframe. mzML and mzXML was read with pyteomics while I used pymzml for _mzML_format.

mobiusklein commented 3 years ago

When I was developing the mzMLb reader, we did do some benchmarking to show it was actually advantageous over mzML beyond the size reduction. It turned out to be sensitive to access patterns and the HDF5 compressor used. Are you able to share the files you benchmarked? I'd like to be able to look into where the slow-down you're observing is.

Were you sequentially reading scans from the file? The mzMLb reader does a small amount of optimization for sequential reading, more as a consequence of HDF5's chunked storage, but this saves additional work like converting these chunks to Python objects.

If you used the psims converter, did you also install hdf5plugin? One of the major advantages of mzMLb is that it opens the way for much faster compressors like blosc and blosc:zstd that are not part of the h5py/hdf5 library itself the way zlib is, but can be installed as plugins. zlib has excellent compression ratios on a broad variety of data types, but it is quite slow. These newer compressors have been shown to be much faster without sacrificing much in terms of compression, particularly on numerical data. There's a pretty good chance that blosc will be required for the future.

mzXML has much less metadata to decode, so it isn't surprising that might go faster.

sorenwacker commented 3 years ago

The files were downloaded from https://www.ebi.ac.uk/metabolights/MTBLS1569/descriptors. And the converted files can be downloaded from https://soerendip.com/dl/MTBLS1569/

I used the 12 files starting with T for the test.

sorenwacker commented 3 years ago

I installed some hdf5 library to make it work, I had opened a GitHub issue, but I don't find it, and I am not sure if it was related to pyteomics or psims.

mobiusklein commented 3 years ago

First off, you're right, initial reading of the mzMLb file was way too slow on a real file. It turns out bytearray(h5py.Dataset) is way slower than bytearray(numpy.array(h5py.Dataset)), for reasons I may try to figure out later. Thank you for reporting that.

Also, thank you for hosting the files you converted and tested. I downloaded each version of Tx_1h_R1 for mzML, mzXML, and mzMLb.

I took a look at the mzMLb file first. Either you didn't have hdf5plugin installed, or psims didn't automatically pick the right compression library for you. It looks like this used gzip compression, which would explain part of the slowdown:

In [1]: import h5py
In [2]: handle = h5py.File("Tx_1h_R1.mzMLb")
In [3]: arr = handle['spectrum_MS_1000514_float64']
In [4]: arr.compression
Out[4]: 'gzip'

I re-converted it from the mzML file with blosc compression instead, denoting it Tx_1h_R1.blosc.mzMLb. Access is around 4x faster. I also added a zlib-binary data array compressed mzML, and added a gzipped version of that.

I compared reading each format from start to end and compared randomly retrieving spectra between a subset of configurations: image image

The sizes for each of these files in megabytes, shows that (after I close that branch out, apparently), you get the best random access time to file size tradeoff using mzMLb. After that optimization, you also get comparable sequential access performance compared to mzML. image

My measurements on mzXML suggest it's a combination of instantiating fewer objects, especially fancy-scalars, per spectrum: mzXML:

{str: 20,
 bool: 1,
 pyteomics.auxiliary.structures.unitint: 3,
 pyteomics.auxiliary.structures.unitfloat: 6,
 numpy.ndarray: 2,
 dict: 1}

mzML:

{pyteomics.auxiliary.structures.unitint: 5,
 str: 11,
 pyteomics.auxiliary.structures.unitfloat: 10,
 pyteomics.auxiliary.structures.cvstr: 18,
 dict: 5,
 list: 2,
 pyteomics.auxiliary.structures.unitstr: 5,
 int: 1,
 numpy.ndarray: 2}

That's the price paid for extra metadata.

Edited - I had an out-of-date copy of master.

sorenwacker commented 3 years ago

Then I wonder how did you create that mzMLb file with the different compression? And should I report that to psims as well or are you already in contact with the developers?

mobiusklein commented 3 years ago

@soerendip I'm the author of psims, so I can look into it from there. What I can tell is that your files claim to have gzip compression. The mzMLb writer in psims can use a variety of compressors (there are actually more that are supported that I haven't implemented yet) to compress the HDF5 file. The problem is that most of them aren't part of the h5py library, they require a separate package, hdf5plugin.

When h5py is installed but hdf5plugin isn't, psims will default to gzip compression, otherwise it uses blosc, though the MzMLToMzMLb class can take a h5_compression parameter to control it.

sorenwacker commented 3 years ago

I will create the files with those changes and see how much it improves. Thank you!

sorenwacker commented 2 years ago

Hi, sorry to bother you again. Are these details already documented somewhere? I am trying to figure out how to do that.

mobiusklein commented 2 years ago

To install hdf5plugin, you should just be able to use pip install hdf5plugin. I added some more documentation on the available compressors by their names. I also added more detail to the transformer in psims for you to use to convert an existing mzML file.

At read time, you have to have access to the same library to decompress, so if you write a file with blosc compression, you need to have hdf5plugin installed to read it too. blosc alone is faster than gzip, but blosc:zstd might give even better compression while being a bit slower than blosc alone (though still faster than gzip). You may need to figure out which works best for you.