theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

Write DelayedArray to file on the R side. #35

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Closes #32.

library(zellkonverter)
library(scRNAseq)
sce <- ZeiselBrainData()
counts(sce) <- DelayedArray(counts(sce))

temp <- tempfile(fileext = ".h5ad")
writeH5AD(sce, temp)

Some points:

Known to-dos:

LTLA commented 3 years ago

Latest commit solves the to-do above, but needs a lot of testing.

Briefly, we set up extensible HDF5 datasets and then iterate across row-wise chunks of the sparse DA. In each chunk, we extract the non-zero values, convert them to compressed sparse row format, and append them to the existing HDF5 dataset via h5set_extent (to make the existing dataset larger) and h5writeDataset to write the new values in the expanded space. This approach respects the memory constraints of block processing, by ensuring that we do not load all non-zero values into memory; while retaining a single-pass approach for rapid processing, by avoiding the need to know the total number of non-zeroes.

The code here makes a number of assumptions:

lazappi commented 3 years ago

Added a couple of tests and fixed how the loop for rewriting matrices handled paths.

LTLA commented 3 years ago

Hit a roadblock in the form of grimbough/rhdf5#79; the AnnData reader doesn't like the fixed-width byte strings that rhdf5 emits.

For the time being, I suggest we just add a clause to the initial check for DAs where we do not skip a DA if it is_sparse() is TRUE. This implies that it will be realized into memory in SCE2AnnData - hopefully this is tolerated on the user machine. We can switch back to the current block processing behavior once the rhdf5 issue is resolved.

LTLA commented 3 years ago

@lazappi were you going to work on this?

lazappi commented 3 years ago

Sorry I've lost track of it a bit. Is it just the is_sparse() check that we need to add?

LTLA commented 3 years ago

Yes, I think line 60 in my PR should only skip the assay if it's a DA and !is_sparse(). Sparse DAs should be coerced to in-memory dgCMatrix objects for the time being.

lazappi commented 3 years ago

Ok, I've added that check. The code for writing sparse DelayedArrays can't be reached by anything now which made the test coverage drop but that's fine and better to have it there for when the rhdf5 issue is fixed. Anything else to add?

LTLA commented 3 years ago

Think that's it, just slap on some # nocov start and ends around it for the time being, I guess.