Open sorenwacker opened 3 years ago
When I was developing the mzMLb reader, we did do some benchmarking to show it was actually advantageous over mzML beyond the size reduction. It turned out to be sensitive to access patterns and the HDF5 compressor used. Are you able to share the files you benchmarked? I'd like to be able to look into where the slow-down you're observing is.
Were you sequentially reading scans from the file? The mzMLb reader does a small amount of optimization for sequential reading, more as a consequence of HDF5's chunked storage, but this saves additional work like converting these chunks to Python objects.
If you used the psims
converter, did you also install hdf5plugin
? One of the major advantages of mzMLb is that it opens the way for much faster compressors like blosc
and blosc:zstd
that are not part of the h5py
/hdf5
library itself the way zlib
is, but can be installed as plugins. zlib
has excellent compression ratios on a broad variety of data types, but it is quite slow. These newer compressors have been shown to be much faster without sacrificing much in terms of compression, particularly on numerical data. There's a pretty good chance that blosc
will be required for the future.
mzXML
has much less metadata to decode, so it isn't surprising that might go faster.
The files were downloaded from https://www.ebi.ac.uk/metabolights/MTBLS1569/descriptors
. And the converted files can be downloaded from https://soerendip.com/dl/MTBLS1569/
I used the 12 files starting with T for the test.
I installed some hdf5 library to make it work, I had opened a GitHub issue, but I don't find it, and I am not sure if it was related to pyteomics or psims.
First off, you're right, initial reading of the mzMLb file was way too slow on a real file. It turns out bytearray(h5py.Dataset)
is way slower than bytearray(numpy.array(h5py.Dataset))
, for reasons I may try to figure out later. Thank you for reporting that.
Also, thank you for hosting the files you converted and tested. I downloaded each version of Tx_1h_R1
for mzML, mzXML, and mzMLb.
I took a look at the mzMLb file first. Either you didn't have hdf5plugin
installed, or psims
didn't automatically pick the right compression library for you. It looks like this used gzip
compression, which would explain part of the slowdown:
In [1]: import h5py
In [2]: handle = h5py.File("Tx_1h_R1.mzMLb")
In [3]: arr = handle['spectrum_MS_1000514_float64']
In [4]: arr.compression
Out[4]: 'gzip'
I re-converted it from the mzML file with blosc
compression instead, denoting it Tx_1h_R1.blosc.mzMLb
. Access is around 4x faster. I also added a zlib-binary data array compressed mzML, and added a gzipped version of that.
I compared reading each format from start to end and compared randomly retrieving spectra between a subset of configurations:
The sizes for each of these files in megabytes, shows that (after I close that branch out, apparently), you get the best random access time to file size tradeoff using mzMLb. After that optimization, you also get comparable sequential access performance compared to mzML.
My measurements on mzXML suggest it's a combination of instantiating fewer objects, especially fancy-scalars, per spectrum: mzXML:
{str: 20,
bool: 1,
pyteomics.auxiliary.structures.unitint: 3,
pyteomics.auxiliary.structures.unitfloat: 6,
numpy.ndarray: 2,
dict: 1}
mzML:
{pyteomics.auxiliary.structures.unitint: 5,
str: 11,
pyteomics.auxiliary.structures.unitfloat: 10,
pyteomics.auxiliary.structures.cvstr: 18,
dict: 5,
list: 2,
pyteomics.auxiliary.structures.unitstr: 5,
int: 1,
numpy.ndarray: 2}
That's the price paid for extra metadata.
Edited - I had an out-of-date copy of master
.
Then I wonder how did you create that mzMLb file with the different compression?
And should I report that to psims
as well or are you already in contact with the developers?
@soerendip I'm the author of psims
, so I can look into it from there. What I can tell is that your files claim to have gzip
compression. The mzMLb writer in psims
can use a variety of compressors (there are actually more that are supported that I haven't implemented yet) to compress the HDF5 file. The problem is that most of them aren't part of the h5py
library, they require a separate package, hdf5plugin
.
When h5py
is installed but hdf5plugin
isn't, psims
will default to gzip
compression, otherwise it uses blosc
, though the MzMLToMzMLb
class can take a h5_compression
parameter to control it.
I will create the files with those changes and see how much it improves. Thank you!
Hi, sorry to bother you again. Are these details already documented somewhere? I am trying to figure out how to do that.
To install hdf5plugin
, you should just be able to use pip install hdf5plugin
. I added some more documentation on the available compressors by their names. I also added more detail to the transformer in psims
for you to use to convert an existing mzML file.
At read time, you have to have access to the same library to decompress, so if you write a file with blosc
compression, you need to have hdf5plugin
installed to read it too. blosc
alone is faster than gzip
, but blosc:zstd
might give even better compression while being a bit slower than blosc
alone (though still faster than gzip
). You may need to figure out which works best for you.
Hi,
I made a quick benchmark with some metabolomics files and measured the read speed of various formats. For some reason I get the worst performance with mzMLb format. I wonder if you ever made a benchmark to compare the read speed in
pyteomics
with other programs.The orange bar is the bare read into memory. Without any conversion into a structured dataframe. mzML and mzXML was read with
pyteomics
while I usedpymzml
for _mzML_format.