rformassspectrometry / QFeatures

Quantitative features for mass spectrometry data
https://RforMassSpectrometry.github.io/QFeatures/
25 stars 7 forks source link

feat: support an HDF5 backend for on-disk memory #171

Open cvanderaa opened 2 years ago

cvanderaa commented 2 years ago

This is a follow up on #157.

For the moment, a QFeatures object is fully stored in memory. However, as assays in a QFeatures object are supposed to be processed sequentially, having all assays lying in RAM may not be efficient for large dataset.

We should provide a function to switch from in-memory to on-disk storage. We would therefore require the DelayedArray class to avoid fetching and sending data to disk at every operation.

cvanderaa commented 2 years ago

Good news! MultiAssayExperiment does most of the heavy lifting thanks to saveHDF5MultiAssayExperiment() and loadHDF5MultiAssayExperiment() :pray:

My strategy is now to do a little refactoring of our code so that every function that adds or replace data can handle HDF5Array objects and has to go through addAssay() and replaceAssay() so that I can focus the management of new HDF5 files for these assays there.

I need advice:

  1. When I say "can handle HDF5Array objects", I mean coercing HDF5Array to matrix. In practice this means bringing the assay(s) to process from disk to memory and creating the new processed assay as matrix (that can later be stored back on disk). This could cause "memory bursts" at every processing step for large datasets. Shall we keep this in mind for later or should I try to tackle this right away? Note this require to send a PR to MsCoreUtils, may take some time to implement and will increase the code complexity.
  2. When a new assay is added or used for replacement, we must decide whether we store it in memory (as matrix) or on-disk (as HDF5Array). We could let the user decide what they prefer, but I'm afraid this will become messy. I would prefer to have either all assays in memory or all assays on disk. What do you think?
  3. I am tempted to add a new slot to QFeatures objects (tentative @hdf5info). I would provides the HDF5 information required to store new assays, but also to facilitate portability of HDF5 backed QFeatures objects. Furthermore, if you agree on the previous point, it would allow to have an unambiguous way to determine whether the QFeatures object is stored on disk or in memory.
  4. saveHDF5MultiAssayExperiment() works perfectly for a QFeatures object, meaning that we could only update the documentation and mentioning the MultiAssayExperiment functionality. However, it may bring confusion and a saveHDF5QFeatures() may be more intuitive. Furthermore, if you agree on adding a new slot, this new function could handle it as expected.