Open mobiusklein opened 5 months ago
I appreciate the PR! Unfortunately relying on an external library to be installed on the system is a non-starter. Not requiring any external dependencies (e.g. having a fully portable binary) is one of the primary goals of Sage.
I somehow missed the part about static linking - in that case, it could be possible.
How big are the binaries? Is there actually any benefit to adding mzMLb support? Is anyone actually using the format? Do other tools support it?
If enough people actually are using mzMLb, I will merge it in... but I am loathe to support yet another XML based format and complicate the build process unless it will actually get used by more than 1 person.
libhdf5
itself is around 10MB, but after enabling feature flag, the final artefact when compiled increased in size by only 0.5MB for sage
when built in release mode.
At the moment, we don't know how many people are using mzMLb since it's a bit of a chicken vs. egg problem. If tools don't support it, then nobody would bother using it, and if nobody uses it tools won't bother supporting it. There's a forthcoming Comet release with mzMLb support in MSToolkit, ProteoWizard supports it so presumably Skyline supports it. If Sage supports it, then there will be more motivation for users who aren't just deleting their mzML files after running a search against them to convert to mzMLb instead. Any new format has to overcome the inertia of existing formats to get people to use it.
If you still don't want to add the optional dependency to your codebase, that's fine too. Getting the build time dependencies working on CI can be aggravating, especially on Windows because vcpkg
seemed to have been designed to be all plumbing and no porcelain. And then mzdata
is still not well exercised beyond files generated by msconvert
and my own toolchains.
There's a poll going around discussing a community-driven design of a new format at https://qualtricsxm5l7r46fqr.qualtrics.com/jfe/form/SV_esUoW2IDzZalb6u and any feedback about what features it needs to have would be appreciated. If you're still working on mz_parquet
, that experience could be useful for overcoming some of the concerns people might have using Parquet as a container. I know I didn't like it because I A) didn't understand how record shredding and compression would interact, B) didn't like the idea of columns being mostly null and C) hadn't learned that the Arrow C++ library made appending and scanning easier in many languages.
Even 10 MB isn't bad - just want to make sure it's not ballooning to hundreds of MB. Let me test out the fork with mzMLb support and see how it works.
I think I am most likely the person that compiles sage the most - and I regularly swap between windows, macOS and linux just to make my life more difficult. I will personally probably never use mzMLb and don't particularly feel like having to install vcpkg et al just to compile sage. But this shouldn't be a gating factor for the project if other people will use it - perhaps we could gate it behind a compile-time feature flag that is only turned on during GH Actions or if users want to manually. I think this is what you've done, right?
That's correct. The mzmlb
feature flag on sage-cli
turns on the mzdata
flag in sage-cloudpath
which pulls in mzdata
and its mzmlb
feature enabled which in turn pulls in hdf5-rs
which requires libhdf5
to build. If you don't turn on those features, you don't need mzdata
et al or libhdf5
.
There's a detail I missed about the linking in hdf5-rs
. I'll be updating this PR soon.
I've updated the PR to build the hdf5
crate with the required feature to make it actually get static linkage. It will compile the HDF5 library from source during build time with CMake, so the HDF5 shared library doesn't need to be present. The resulting binary is ~3.5MB heavier than the original without the feature enabled.
This PR adds a new optional dependency on mzdata to read the mzMLb file format.
mzMLb support was recently added to ProteoWizard, and it provides fast random access to compressed data, and comparable sequential access speed to mzML.
Building with the mzMLb feature statically links the HDF5 library with the binary, so there is no impact on the end-user beyond the increased binary size, but it does require that the HDF5 library is available at build time. The tests were done with
libhdf5
10.3 which can be gotten via the system package manager on Linux systems and usingvcpkg
on Windows.