Closed mobiusklein closed 3 years ago
Another thing I like about this scheme is that because HDF5 isn't a single contiguous byte stream, it makes adding custom indices much easier and less invasive. The idea that you can just write extra arrays to the file without needing to completely re-write the file means we can add some mutability to this data reader if we want.
I've gone ahead and written a pretty simple mzMLb writer in https://github.com/mobiusklein/psims to explore some of this eventually.
After testing, locally, I'm pretty satisfied with where this implementation is. I've made educated guesses at what the future controlled vocabulary parameters the ProteoWizard implementation will use when merged, but I've retained the Biospi names too in case.
The truncate-and-predict compression aides are implemented, but they require sequential computation so we can't vectorize them with NumPy.
I added a new "extras" installation section to support this, and then made an "all" category for convenience.
array conversion (implemented by ArrayConversionMixin) doesn't work, because the overridden _handle_binary doesn't call _convert_array;
The arrays are transparently decompressed by HDF5 when they're read, so we don't need to explicitly decompress them. They're also stored directly, byte-for-byte in the HDF5 file, so no base64 decoding is needed. There are two scenarios where something more is required and that is partially implemented in ArrayConversionMixin
. The first is supporting decode_arrays=False
, and the second is compression helping numerical transformation reversal (e.g. linear prediction, delta prediction, numpress). The linear prediction and delta prediction are handled in-line because they were added with mzMLb, but it's not clear that they will be used outside mzMLb. Numpress can be used with mzMLb but it's kind of shoved to the side in the paper because linear prediction is very similar to Numpress while being much simpler.
I might be able to implement something that is more abstract to support decode_arrays
and numerical transforms, and then derive ArrayConversionMixin
from that and have that handle the zlib
-like compression.
we are missing the FileReader interface that does the work of supporting both file path and file object as main argument to the parser constructor
FileReader also handles the context manager support, which is re-implemented directly on MzMLb, as well as parts of the XML interface for iterfind support.
h5py.File
already handles path vs file-like object internally (though there are performance consequences to using a file-like object of course so I don't want to just default to always giving it a file-like object). The h5py.File
object however doesn't give the same interface that a file-like object would, much of that has to be mocked/redirected. In an effort to look exactly like the MzML
reader, those often end being explicitly redirected to the mzml_parser
attribute, but that only superficially works, and won't give you things like fileno
, only read
, seek
, tell
, and closed
.
The FileReader
class is tightly coupled to _file_obj
, but I could inherit from TimeOrderedIndexedReaderMixin
, which I'm already overriding most the methods on anyway to preserve that type of type checking. If we need to preserve FileReader
as a base type, then we could turn it into an abstract base class and use the ABC.register
method to satisfy issubclass(MzMLb, FileReader)
?
As for the XML
-related duplication, again I've forwarded the relevant methods.
Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb
be ExternalDataMzML
, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML
own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.__init__
. This would lead to a much more convoluted implementation of reset
and the indexing cascade.
Ignore the complaints about ArrayConversionMixin
. I see what you were referring to, it wasn't about decompression, but dtype casting, which I had forgotten about.
Much of this issue could be avoided if instead of insisting on using straight-forward composition, I turn this object hierarchy inside-out. Let MzMLb be ExternalDataMzML, and then attach all the lower-level HDF5 manipulation to a new class and let ExternalDataMzML own that object, except that HDF5 manipulation object is involved much more heavily in the initialization of the MzML parser than the normal MzML.init. This would lead to a much more convoluted implementation of reset and the indexing cascade.
This would seem more natural to me, conceptually, but if it feels "inside out" to you, I won't insist on it. In that case using the ABC machinery should probably work, or I can try and see if I can put something together myself. I haven't thought this through as far as you have yet.
Made FileReader
an abstract base class (with no abstract methods) and then registered MzMLb
with it. A workaround that may only add a few nanoseconds to all future reader instantiations.
I haven't expanded on mutation of the HDF5 file, but the option is there. I'm reasonably happy with the state of things if you are.
This is a draft pull request for reading
mzMLb
files. Implementation based upon the reference implementation at https://github.com/biospi/pwiz.The built artifact from https://github.com/biospi/pwiz let me convert an mzML file to mzMLb and then read it using the new code in this branch. I don't think this is ready yet because there's information missing about how to detect the lossy compression scheme (
--mzLinear
and--intenLinear
or--mzDelta
and--intenDelta
).Source: https://doi.org/10.1021/acs.jproteome.0c00192
Helpful features that mzMLb provides include:
In order to use this though, we need to be able to read HDF5 files, potentially including extra HDF5 plugins to access more compressors. This adds depends on h5py and hdf5plugin as dependencies. They're pretty heavy, but well packaged. If we don't want these to just be optional dependencies but part of the core, I can make this a namespace member package instead.