levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Draft mzmlb implementation #35

Closed mobiusklein closed 3 years ago

mobiusklein commented 3 years ago

This is a draft pull request for reading mzMLb files. Implementation based upon the reference implementation at https://github.com/biospi/pwiz.

The built artifact from https://github.com/biospi/pwiz let me convert an mzML file to mzMLb and then read it using the new code in this branch. I don't think this is ready yet because there's information missing about how to detect the lossy compression scheme (--mzLinear and --intenLinear or --mzDelta and --intenDelta).

Source: https://doi.org/10.1021/acs.jproteome.0c00192

Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021).
MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant
mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research,
20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192

Helpful features that mzMLb provides include:

  1. Fast random access to compressed data while achieving better compression ratios than compressing XML.
  2. Better pre-built index support so no need to scan the entire file start-to-finish to build reliable indices.
  3. Simpler compression helper encodings

In order to use this though, we need to be able to read HDF5 files, potentially including extra HDF5 plugins to access more compressors. This adds depends on h5py and hdf5plugin as dependencies. They're pretty heavy, but well packaged. If we don't want these to just be optional dependencies but part of the core, I can make this a namespace member package instead.

mobiusklein commented 3 years ago

Another thing I like about this scheme is that because HDF5 isn't a single contiguous byte stream, it makes adding custom indices much easier and less invasive. The idea that you can just write extra arrays to the file without needing to completely re-write the file means we can add some mutability to this data reader if we want.

I've gone ahead and written a pretty simple mzMLb writer in https://github.com/mobiusklein/psims to explore some of this eventually.

mobiusklein commented 3 years ago

After testing, locally, I'm pretty satisfied with where this implementation is. I've made educated guesses at what the future controlled vocabulary parameters the ProteoWizard implementation will use when merged, but I've retained the Biospi names too in case.

The truncate-and-predict compression aides are implemented, but they require sequential computation so we can't vectorize them with NumPy.

I added a new "extras" installation section to support this, and then made an "all" category for convenience.

mobiusklein commented 3 years ago

array conversion (implemented by ArrayConversionMixin) doesn't work, because the overridden _handle_binary doesn't call _convert_array;

The arrays are transparently decompressed by HDF5 when they're read, so we don't need to explicitly decompress them. They're also stored directly, byte-for-byte in the HDF5 file, so no base64 decoding is needed. There are two scenarios where something more is required and that is partially implemented in ArrayConversionMixin. The first is supporting decode_arrays=False, and the second is compression helping numerical transformation reversal (e.g. linear prediction, delta prediction, numpress). The linear prediction and delta prediction are handled in-line because they were added with mzMLb, but it's not clear that they will be used outside mzMLb. Numpress can be used with mzMLb but it's kind of shoved to the side in the paper because linear prediction is very similar to Numpress while being much simpler.

I might be able to implement something that is more abstract to support decode_arrays and numerical transforms, and then derive ArrayConversionMixin from that and have that handle the zlib-like compression.

we are missing the FileReader interface that does the work of supporting both file path and file object as main argument to the parser constructor

FileReader also handles the context manager support, which is re-implemented directly on MzMLb, as well as parts of the XML interface for iterfind support.

h5py.File already handles path vs file-like object internally (though there are performance consequences to using a file-like object of course so I don't want to just default to always giving it a file-like object). The h5py.File object however doesn't give the same interface that a file-like object would, much of that has to be mocked/redirected. In an effort to look exactly like the MzML reader, those often end being explicitly redirected to the mzml_parser attribute, but that only superficially works, and won't give you things like fileno, only read, seek, tell, and closed.

The FileReader class is tightly coupled to _file_obj, but I could inherit from TimeOrderedIndexedReaderMixin, which I'm already overriding most the methods on anyway to preserve that type of type checking. If we need to preserve FileReader as a base type, then we could turn it into an abstract base class and use the ABC.register method to satisfy issubclass(MzMLb, FileReader)?

As for the XML-related duplication, again I've forwarded the relevant methods.

Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.__init__. This would lead to a much more convoluted implementation of reset and the indexing cascade.

mobiusklein commented 3 years ago

Ignore the complaints about ArrayConversionMixin. I see what you were referring to, it wasn't about decompression, but dtype casting, which I had forgotten about.

levitsky commented 3 years ago

Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.init. This would lead to a much more convoluted implementation of reset and the indexing cascade.

This would seem more natural to me, conceptually, but if it feels "inside out" to you, I won't insist on it. In that case using the ABC machinery should probably work, or I can try and see if I can put something together myself. I haven't thought this through as far as you have yet.

mobiusklein commented 3 years ago

Made FileReader an abstract base class (with no abstract methods) and then registered MzMLb with it. A workaround that may only add a few nanoseconds to all future reader instantiations.

I haven't expanded on mutation of the HDF5 file, but the option is there. I'm reasonably happy with the state of things if you are.