theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
144 stars 27 forks source link

writeH5AD fails for very large datasets (> 1.5 million cells) #73

Open GabrielHoffman opened 1 year ago

GabrielHoffman commented 1 year ago

Hi Luke, Thanks again for the package, I use it every day!

I have a huge H5AD file of 40k genes and 3.7M cells. I load it into R with readH5AD(...,use_hdf5=TRUE). After QC and filtering I want to write a 1.5M cells to another H5AD file. When I use writeH5AD(sce[,include],outfile) I get a segfault after ~20 minutes. Memory shouldn't be an issue since I requested 576 Gb RAM on my compute node. I managed to solve this by 1) writing the SingleCellExperiment as 4 chunks to separate H5AD files, 2) then using AnnData in python to concatenate the 4 files into a single H5AD.

I am using R 4.2.0 zellkonverter v1.6.5

Have you encountered this issue with large datasets? I wanted to check with you first since creating a reproducible examine I can share will take a substantial amount of work.

Best, Gabriel

lazappi commented 1 year ago

Hi @GabrielHoffman

That is indeed a large dataset! I think the largest I have ever tried is a few hundred thousand cells. I'm actually fairly impressed you manage to work with it in both R and Python and it's just the conversion that seems to be the issue.

Have you tried running it with verbose = TRUE? That would be helpful for figuring out which part is failing.