waldronlab / MultiAssayExperiment

Bioconductor package for management of multi-assay data
https://waldronlab.io/MultiAssayExperiment/
69 stars 32 forks source link

Error in h5checktype(). The provided H5Identifier is not a dataset identifier. #325

Open JHYSiu opened 1 year ago

JHYSiu commented 1 year ago

Hi

Sorry, I'm unfamiliar with converting between MuData and R format.

I am unable to load my h5mu data in. I keep getting "Error in h5checktype(). The provided H5Identifier is not a dataset identifier." I'm not sure how to fix this. Thanks!

From the rhdf5::H5Fopen, it appears to still match the format from the dummy example data.

I exported my h5mu with just mdata.write("FILENAME.h5mu") sessionInfo

lwaldron commented 1 year ago

Hi @JHYSiu, could you provide us with a reproducible example, using a file/dataset that we can access? I couldn't tell what you were doing when the error occurred. Note, the MuData converters were written by the MuData team and not by us, so we may not be able to help, but still I'd like to see a dataset and code to see what I can do.

JHYSiu commented 1 year ago

Thanks! Here's a massively subsetted example.

Python export code: subset_mdata.write("example_subsetted_mudata.h5mu")

R import code: sce <- readH5MU("example_subsetted_mudata.h5mu")

R session info

sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /gpfs3/apps/eb/2020b/skylake/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3.0

locale: [1] C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] MultiAssayExperiment_1.20.0 SummarizedExperiment_1.24.0 [3] Biobase_2.54.0 GenomicRanges_1.46.1
[5] GenomeInfoDb_1.30.1 IRanges_2.28.0
[7] MatrixGenerics_1.6.0 matrixStats_1.0.0
[9] MuData_0.99.9 rhdf5_2.45.1
[11] S4Vectors_0.32.4 BiocGenerics_0.40.0
[13] Matrix_1.5-3

loaded via a namespace (and not attached): [1] lattice_0.20-44 SingleCellExperiment_1.16.0 [3] bitops_1.0-7 rhdf5filters_1.6.0
[5] grid_4.1.0 zlibbioc_1.40.0
[7] XVector_0.34.0 Rhdf5lib_1.16.0
[9] tools_4.1.0 RCurl_1.98-1.12
[11] DelayedArray_0.20.0 compiler_4.1.0
[13] GenomeInfoDbData_1.2.7

example_subsetted_mudata.zip python_piplist.txt

vjcitn commented 1 year ago

Thanks for uploading the example. I have a feeling that the MuData package is going to need some enhancements to work with your use case. Here's how I look at the situation. In the MuData package there is an example that produces an h5mu file from a TCGA MultiAssayExperiment, called miniacc.h5mu. We can use the h5ls utility (get from hdfgroup or brew etc. if you don't already have it) to look at the layout:

%> h5ls miniacc.h5mu 
mod                      Group
obs                      Group
obsm                     Group
obsmap                   Group
uns                      Group
var                      Group

That's the top level Group list. Then drill down:

%> h5ls miniacc.h5mu/mod 
Mutations                Group
RNASeq2GeneNorm          Group
RPPAArray                Group
gistict                  Group
miRNASeqGene             Group
%> h5ls miniacc.h5mu/mod/Mutations
X                        Dataset {90, 97}
obs                      Group
var                      Group
%> h5ls miniacc.h5mu/mod/RNASeq2GeneNorm
X                        Dataset {79, 198}
obs                      Group
uns                      Group
var                      Group

We can see that as we descend in the Group hierarchy we find Dataset instances named X.

For your example, which was generated using python and not MuData::writeH5MU, we see

%> h5ls example_subsetted_mudata.h5mu
mod                      Group
obs                      Group
obsm                     Group
obsmap                   Group
obsp                     Group
uns                      Group
var                      Group
varm                     Group
varmap                   Group
varp                     Group
%> h5ls example_subsetted_mudata.h5mu/mod
prot                     Group
rna                      Group
%> h5ls example_subsetted_mudata.h5mu/mod/prot
X                        Dataset {21, 145}
layers                   Group
obs                      Group
obsm                     Group
obsp                     Group
uns                      Group
var                      Group
varm                     Group
varp                     Group

I am not sure that the MuData readH5MU is suited for this structure. You should contact the MuData authors for clarification. In their DESCRIPTION I see

Package: MuData
Title: Serialization for MultiAssayExperiment Objects

implying that the round trips must start with MAE. Here's where to post your question: https://github.com/ilia-kats/MuData/issues

Finally I do not think it will be difficult to dig data out of your .h5mu file using rhdf5 and/or reticulate with h5py imported. You might post a query to support.bioconductor.org where someone may have already tackled this.

lwaldron commented 1 year ago

Thanks for looking into this @vjcitn! Helpful for me too.

mtmorgan commented 1 year ago

rhdf5::h5ls() provides an alternative to the h5ls command line tool; especially useful with dplyr::as_tibble() and friends.

For what it's worth the (under development github) anndataR package provides some useful building blocks for a 'native' R parser, and the anndata / mudata structures are very similar. Here's some code that starts down the path (the RNA experiment, obs and assays only...; this uses the 'devel' version of Bioconductor and hence rhdf5, which includes an important extension for managing HDF5 'enum' types).

remotes::install_github("scverse/anndataR")

library(rhdf5)
library(dplyr)

## helper function to get names of group
group_names <-
    function(tbl, group_name)
{
    tbl |>
    filter(.data$group %in% group_name) |>
    pull(.data$name)
}

## 
## worked example
##

file <- "example_subsetted_mudata.h5mu"

## file content
tbl <-
    h5ls(file) |>
    tibble()

## sketch content of single assay; similarly for var and uns
assay <- "/mod/rna"
layers_group <- paste0(assay, "/layers")
obs_group <- paste0(assay, "/obs")
obsm_group <- paste0(assay, "/obsm")
obsp_group <- paste0(assay, "/obsp")

## layers
layers_names <- group_names(tbl, layers_group)
assays <-
    lapply(setNames(nm = layers_names), \(name, file, group) {
        path <- paste0(group, "/", name)
        ## FIXME: dgCMatrix, or HDF5Array::TENxMatrix (on-disk)?
        anndataR:::read_h5ad_sparse_array(file, path) |>
            Matrix::t()
}, file, layers_group)

## obs
obs <-
    anndataR:::read_h5ad_data_frame(file, obs_group) |>
    tibble()

## obsm
obsm_names <- group_names(tbl, obsm_group)
obsm <-
    lapply(setNames(nm = obsm_names), \(name, file, group) {
        path <- paste0(group, "/", name)
        anndataR:::read_h5ad_dense_array(file, path)
    }, file, obsm_group)

## obsp
obsp_names <- group_names(tbl, obsp_group)
obsp <-
    lapply(setNames(nm = obsp_names), \(name, file, group) {
        path <- paste0(group, "/", name)
        ## FIXME: dgCMatrix, transposed?
        anndataR:::read_h5ad_sparse_array(file, path)
    }, file, obsp_group)

## assemble to SingleCellExperiment
sce <- SingleCellExperiment::SingleCellExperiment(
    assays,
    colData = obs,
    reducedDims = obsm,
    colPairs = obsp
)

It's interesting / unfortunate that constructing a MAE loads most of the data (even if the 'large' matrix data were to be left on-disk via TENxMatrix) into memory...

JHYSiu commented 1 year ago

Excellent thank you everyone. I will have a play around and see if I can get it to work! I'll let you know if I come up with anything robust. :)