rformassspectrometry / Spectra

Low level infrastructure to handle MS spectra
https://rformassspectrometry.github.io/Spectra/
34 stars 24 forks source link

Export Spectra as HDF5 format (.h5)? #229

Open Don86 opened 2 years ago

Don86 commented 2 years ago

Hi,

I'd like to ask if there's currently a way to write out a Spectra S4 object, probably initially read as .mzML or .mzXML, as .h5? There doesn't seem to be this capability from what I'm seen in the manual. HDF5 seems like a better storage option since it has a smaller file size, well-supported outside of the mass spec world, and easily-interoperable with Python as well.

Regards, Don

lgatto commented 2 years ago

There's the MsBackendHdf5Peaks backend that stores the m/z and intensities on-disk in custom hdf5 data files. The spectra variables are still stored and manipulated in memory (in a DataFrame).

When you say HDF5 seems like a better storage option, I assume to refer to mzML. Even though you aren't wrong, mzML (a specific XML-based implementation for MS data that is widely adopted) and HDF5 (a general data storage system) are hardly directly comparable.

lgatto commented 2 years ago

By the way, I'm transferring this issue from the RforMassSpectrometry.org repo to the Spectra package, which is where the backend class and interface is defined.

Adafede commented 1 year ago

Happy to find this issue still opened! Would be great indeed 😊

jorainer commented 1 year ago

Note that there are different backends already available that support export in a variety of formats. You could import a mzML and export that as an MGF file using the MsBackendMgf backend - but that might not be efficient. As an alternative possibility you could store the MS data from an mzML file into a SQL database (SQLite or MySQL) using the MsBackendSql - but again, that's no standard format - it's the format we define. But you could read/import that data from the SQLite or MySQL database also from python et al.

Adafede commented 1 year ago

I saw them, and they are great for so many cases!

My (probably relatively seldom) use case is matching (few) spectra against a (HUGE) spectral library, which stays fix for very long. My feeling is that loading with an MGF backend takes ages, while loading with a DB backend indeed faster, but still far from hd5.

We faced this issue of 99% of the time taken by loading of the spectra (not the matching) in our https://github.com/mandelbrot-project/spectral_lib_matcher#using-binary-libraries, reason why we implemented binary libraries.

jorainer commented 1 year ago

@Adafede , if you have a huge reference spectral library, you might consider storing that into a CompDb database (from the CompoundDb package). That package provides also a Spectra backend retrieving the data directly from the database. That should be faster then using an MGF backend.