Open mt1022 opened 10 months ago
I think that should work with the current master branch. Can you install the current version from Github with remotes::install_github("ilia-kats/MuData")
and try again?
@gtca Any chance we can update the package in Bioc?
You are right. I installed the latest developmental version and the down-sampled (and simplified) data can be loaded without any errors or warnings. It seems that the full dataset can be loaded too. However, although currently there is no error messages, this process is extremely slow. it has been running for about 20 min and has not finished at the time of writing :(. The full dataset has about 100k cells x 2k genes and can be loaded by scanpy in less than 1 min.
It looks like it's spending all that time inside rhdf5's _H5Dread
C function. I don't think there is anything we can do about it and I also don't know why it's taking that much longer in R compared to Python, since both are using the same HDF5 C library underneath. Perhaps @grimbough has an idea?
It looks like it's spending most of the time reading some fairly large compound datasets e.g. /uns/leiden_res14.0__rank_genes_groups/names
I'm not too familiar with the file structure, so I'm not sure if they need to be compound datatypes, but the look like they could just be arrays of string which I suspect would be quicker to read.
I'm actually surprised that the python implementation is so fast. Using the h5dump
command line tool to read the largest of those datasets takes over 3 minutes:
-> % { time h5dump -d /uns/leiden_res50.0__rank_genes_groups/names s_fca_biohub_head_10x.h5ad 1> /dev/null ; }
h5dump -d /uns/leiden_res50.0__rank_genes_groups/names > /dev/null 194,93s user 0,82s system 99% cpu 3:16,47 total
Do the python and R implementations actually do the same things?
The file structure can be examined in python:
import scanpy as sc
fca_head = sc.read_h5ad('s_fca_biohub_head_10x.h5ad')
fca_head.uns['leiden_res14.0__rank_genes_groups']['names']
/uns/leiden_res14.0__rank_genes_groups/names
is an array of tuples of strings. I have no idea why they use such complex data structure :(. It was probably generated by other software they used in the analysis.
Do the python and R implementations actually do the same things?
They should for most things, but I have to admit we haven't paid a lot of attention to compound datatypes (presumably the equivalent to numpy.recarray
) because we have never seen them used extensively in the wild and people tend to use DataFrames for this functionality nowadays.
Do the python and R implementations actually do the same things?
They should for most things, but I have to admit we haven't paid a lot of attention to compound datatypes (presumably the equivalent to
numpy.recarray
) because we have never seen them used extensively in the wild and people tend to use DataFrames for this functionality nowadays.
I guess my question was really "does scanpy actually read whatever /uns/leiden_res14.0__rank_genes_groups/names
is?". My first guess would have been that this was an inefficiency in how rhdf5
was dealing with compound data types, but based on how long it takes h5dump
to read that same structure, it looks like it's just slow regardless. Given that, I'd be very surprised python was actually reading it, but I'm very naive about the scanpy functionality.
/uns
is read unconditionally by both Python and R. /X
is only read if backed=False
, also in both Python and R. /layers
is read unconditionally in Python, but is only read if backed=FALSE
in R.
Interesting, I wonder how come it's so much more efficient in python. As you said earlier, they all ultimately use H5Dread
.
Dear developers,
I am trying to read an h5ad file into R with
MuData
and encountered the following error:This is my command:
I am not sure what's going wrong. This dataset can be loaded by
scanpy.read_h5ad
in python. I also tried to down-sample a subset of cells of this dataset in python and save a new h5ad. The down-sampled file cannot be read in R either.The file
s_fca_biohub_head_10x.h5ad
is available from Fly cell atlas: https://cloud.flycellatlas.org/index.php/s/LAEybPc2HZnpzKs