Open cvanderaa opened 2 years ago
Good news! MultiAssayExperiment
does most of the heavy lifting thanks to saveHDF5MultiAssayExperiment()
and loadHDF5MultiAssayExperiment()
:pray:
My strategy is now to do a little refactoring of our code so that every function that adds or replace data can handle HDF5Array
objects and has to go through addAssay()
and replaceAssay()
so that I can focus the management of new HDF5 files for these assays there.
I need advice:
HDF5Array
objects", I mean coercing HDF5Array
to matrix
. In practice this means bringing the assay(s) to process from disk to memory and creating the new processed assay as matrix
(that can later be stored back on disk). This could cause "memory bursts" at every processing step for large datasets. Shall we keep this in mind for later or should I try to tackle this right away? Note this require to send a PR to MsCoreUtils, may take some time to implement and will increase the code complexity.matrix
) or on-disk (as HDF5Array
). We could let the user decide what they prefer, but I'm afraid this will become messy. I would prefer to have either all assays in memory or all assays on disk. What do you think?@hdf5info
). I would provides the HDF5 information required to store new assays, but also to facilitate portability of HDF5 backed QFeatures
objects. Furthermore, if you agree on the previous point, it would allow to have an unambiguous way to determine whether the QFeatures
object is stored on disk or in memory.saveHDF5MultiAssayExperiment()
works perfectly for a QFeatures
object, meaning that we could only update the documentation and mentioning the MultiAssayExperiment
functionality. However, it may bring confusion and a saveHDF5QFeatures()
may be more intuitive. Furthermore, if you agree on adding a new slot, this new function could handle it as expected.
This is a follow up on #157.
For the moment, a
QFeatures
object is fully stored in memory. However, as assays in aQFeatures
object are supposed to be processed sequentially, having all assays lying in RAM may not be efficient for large dataset.We should provide a function to switch from in-memory to on-disk storage. We would therefore require the
DelayedArray
class to avoid fetching and sending data to disk at every operation.