Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD

ilia-kats / MuData

MuData-compatible storage for bioconductor's MultiAssayExperiment

https://ilia-kats.github.io/MuData/

5 stars 4 forks source link

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") when using readH5AD #8

Open mt1022 opened 10 months ago

mt1022 commented 10 months ago

Dear developers,

I am trying to read an h5ad file into R with MuData and encountered the following error:

Error in is.factor(values) && levels(values) == c("FALSE", "TRUE") : 
  'length = 2' in coercion to 'logical(1)'

This is my command:

library(Seurat)
library(MuData)
library(tidyverse)

fca_head <- readH5AD('s_fca_biohub_head_10x.h5ad')

I am not sure what's going wrong. This dataset can be loaded by scanpy.read_h5ad in python. I also tried to down-sample a subset of cells of this dataset in python and save a new h5ad. The down-sampled file cannot be read in R either.

The file s_fca_biohub_head_10x.h5ad is available from Fly cell atlas: https://cloud.flycellatlas.org/index.php/s/LAEybPc2HZnpzKs

ilia-kats commented 10 months ago

I think that should work with the current master branch. Can you install the current version from Github with remotes::install_github("ilia-kats/MuData") and try again?

@gtca Any chance we can update the package in Bioc?

mt1022 commented 9 months ago

You are right. I installed the latest developmental version and the down-sampled (and simplified) data can be loaded without any errors or warnings. It seems that the full dataset can be loaded too. However, although currently there is no error messages, this process is extremely slow. it has been running for about 20 min and has not finished at the time of writing :(. The full dataset has about 100k cells x 2k genes and can be loaded by scanpy in less than 1 min.

ilia-kats commented 9 months ago

It looks like it's spending all that time inside rhdf5's _H5Dread C function. I don't think there is anything we can do about it and I also don't know why it's taking that much longer in R compared to Python, since both are using the same HDF5 C library underneath. Perhaps @grimbough has an idea?

grimbough commented 9 months ago

It looks like it's spending most of the time reading some fairly large compound datasets e.g. /uns/leiden_res14.0__rank_genes_groups/names I'm not too familiar with the file structure, so I'm not sure if they need to be compound datatypes, but the look like they could just be arrays of string which I suspect would be quicker to read.

I'm actually surprised that the python implementation is so fast. Using the h5dump command line tool to read the largest of those datasets takes over 3 minutes:

-> % { time h5dump -d /uns/leiden_res50.0__rank_genes_groups/names s_fca_biohub_head_10x.h5ad 1> /dev/null ; }
h5dump -d /uns/leiden_res50.0__rank_genes_groups/names  > /dev/null  194,93s user 0,82s system 99% cpu 3:16,47 total

Do the python and R implementations actually do the same things?

mt1022 commented 9 months ago

The file structure can be examined in python:

import scanpy as sc
fca_head = sc.read_h5ad('s_fca_biohub_head_10x.h5ad')
fca_head.uns['leiden_res14.0__rank_genes_groups']['names']

/uns/leiden_res14.0__rank_genes_groups/names is an array of tuples of strings. I have no idea why they use such complex data structure :(. It was probably generated by other software they used in the analysis.

ilia-kats commented 9 months ago

Do the python and R implementations actually do the same things?

They should for most things, but I have to admit we haven't paid a lot of attention to compound datatypes (presumably the equivalent to numpy.recarray) because we have never seen them used extensively in the wild and people tend to use DataFrames for this functionality nowadays.

grimbough commented 9 months ago

Do the python and R implementations actually do the same things?

They should for most things, but I have to admit we haven't paid a lot of attention to compound datatypes (presumably the equivalent to numpy.recarray) because we have never seen them used extensively in the wild and people tend to use DataFrames for this functionality nowadays.

I guess my question was really "does scanpy actually read whatever /uns/leiden_res14.0__rank_genes_groups/names is?". My first guess would have been that this was an inefficiency in how rhdf5 was dealing with compound data types, but based on how long it takes h5dump to read that same structure, it looks like it's just slow regardless. Given that, I'd be very surprised python was actually reading it, but I'm very naive about the scanpy functionality.

ilia-kats commented 9 months ago

/uns is read unconditionally by both Python and R. /X is only read if backed=False, also in both Python and R. /layers is read unconditionally in Python, but is only read if backed=FALSE in R.

grimbough commented 9 months ago

Interesting, I wonder how come it's so much more efficient in python. As you said earlier, they all ultimately use H5Dread.