lazear / sage

Proteomics search & quantification so fast that it feels like magic
https://sage-docs.vercel.app
MIT License
201 stars 38 forks source link

Add support for mzMLb #113

Open mobiusklein opened 5 months ago

mobiusklein commented 5 months ago

This PR adds a new optional dependency on mzdata to read the mzMLb file format.

mzMLb support was recently added to ProteoWizard, and it provides fast random access to compressed data, and comparable sequential access speed to mzML.

Building with the mzMLb feature statically links the HDF5 library with the binary, so there is no impact on the end-user beyond the increased binary size, but it does require that the HDF5 library is available at build time. The tests were done with libhdf5 10.3 which can be gotten via the system package manager on Linux systems and using vcpkg on Windows.

lazear commented 5 months ago

I appreciate the PR! Unfortunately relying on an external library to be installed on the system is a non-starter. Not requiring any external dependencies (e.g. having a fully portable binary) is one of the primary goals of Sage.

lazear commented 5 months ago

I somehow missed the part about static linking - in that case, it could be possible.

How big are the binaries? Is there actually any benefit to adding mzMLb support? Is anyone actually using the format? Do other tools support it?

If enough people actually are using mzMLb, I will merge it in... but I am loathe to support yet another XML based format and complicate the build process unless it will actually get used by more than 1 person.

mobiusklein commented 5 months ago

libhdf5 itself is around 10MB, but after enabling feature flag, the final artefact when compiled increased in size by only 0.5MB for sage when built in release mode.

At the moment, we don't know how many people are using mzMLb since it's a bit of a chicken vs. egg problem. If tools don't support it, then nobody would bother using it, and if nobody uses it tools won't bother supporting it. There's a forthcoming Comet release with mzMLb support in MSToolkit, ProteoWizard supports it so presumably Skyline supports it. If Sage supports it, then there will be more motivation for users who aren't just deleting their mzML files after running a search against them to convert to mzMLb instead. Any new format has to overcome the inertia of existing formats to get people to use it.

If you still don't want to add the optional dependency to your codebase, that's fine too. Getting the build time dependencies working on CI can be aggravating, especially on Windows because vcpkg seemed to have been designed to be all plumbing and no porcelain. And then mzdata is still not well exercised beyond files generated by msconvert and my own toolchains.

There's a poll going around discussing a community-driven design of a new format at https://qualtricsxm5l7r46fqr.qualtrics.com/jfe/form/SV_esUoW2IDzZalb6u and any feedback about what features it needs to have would be appreciated. If you're still working on mz_parquet, that experience could be useful for overcoming some of the concerns people might have using Parquet as a container. I know I didn't like it because I A) didn't understand how record shredding and compression would interact, B) didn't like the idea of columns being mostly null and C) hadn't learned that the Arrow C++ library made appending and scanning easier in many languages.

lazear commented 5 months ago

Even 10 MB isn't bad - just want to make sure it's not ballooning to hundreds of MB. Let me test out the fork with mzMLb support and see how it works.

I think I am most likely the person that compiles sage the most - and I regularly swap between windows, macOS and linux just to make my life more difficult. I will personally probably never use mzMLb and don't particularly feel like having to install vcpkg et al just to compile sage. But this shouldn't be a gating factor for the project if other people will use it - perhaps we could gate it behind a compile-time feature flag that is only turned on during GH Actions or if users want to manually. I think this is what you've done, right?

mobiusklein commented 5 months ago

That's correct. The mzmlb feature flag on sage-cli turns on the mzdata flag in sage-cloudpath which pulls in mzdata and its mzmlb feature enabled which in turn pulls in hdf5-rs which requires libhdf5 to build. If you don't turn on those features, you don't need mzdata et al or libhdf5.

mobiusklein commented 5 months ago

There's a detail I missed about the linking in hdf5-rs. I'll be updating this PR soon.

mobiusklein commented 5 months ago

I've updated the PR to build the hdf5 crate with the required feature to make it actually get static linkage. It will compile the HDF5 library from source during build time with CMake, so the HDF5 shared library doesn't need to be present. The resulting binary is ~3.5MB heavier than the original without the feature enabled.