Open JHYSiu opened 1 year ago
Hi @JHYSiu, could you provide us with a reproducible example, using a file/dataset that we can access? I couldn't tell what you were doing when the error occurred. Note, the MuData converters were written by the MuData team and not by us, so we may not be able to help, but still I'd like to see a dataset and code to see what I can do.
Thanks! Here's a massively subsetted example.
Python export code: subset_mdata.write("example_subsetted_mudata.h5mu")
R import code: sce <- readH5MU("example_subsetted_mudata.h5mu")
R session info
sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
Matrix products: default BLAS/LAPACK: /gpfs3/apps/eb/2020b/skylake/software/FlexiBLAS/3.0.4-GCC-10.3.0/lib64/libflexiblas.so.3.0
locale: [1] C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] MultiAssayExperiment_1.20.0 SummarizedExperiment_1.24.0
[3] Biobase_2.54.0 GenomicRanges_1.46.1
[5] GenomeInfoDb_1.30.1 IRanges_2.28.0
[7] MatrixGenerics_1.6.0 matrixStats_1.0.0
[9] MuData_0.99.9 rhdf5_2.45.1
[11] S4Vectors_0.32.4 BiocGenerics_0.40.0
[13] Matrix_1.5-3
loaded via a namespace (and not attached):
[1] lattice_0.20-44 SingleCellExperiment_1.16.0
[3] bitops_1.0-7 rhdf5filters_1.6.0
[5] grid_4.1.0 zlibbioc_1.40.0
[7] XVector_0.34.0 Rhdf5lib_1.16.0
[9] tools_4.1.0 RCurl_1.98-1.12
[11] DelayedArray_0.20.0 compiler_4.1.0
[13] GenomeInfoDbData_1.2.7
Thanks for uploading the example. I have a feeling that the MuData package is going to need some enhancements to work with your use case. Here's how I look at the situation. In the MuData package there is an example that produces an h5mu file from a TCGA MultiAssayExperiment, called miniacc.h5mu. We can use the h5ls utility (get from hdfgroup or brew etc. if you don't already have it) to look at the layout:
%> h5ls miniacc.h5mu
mod Group
obs Group
obsm Group
obsmap Group
uns Group
var Group
That's the top level Group list. Then drill down:
%> h5ls miniacc.h5mu/mod
Mutations Group
RNASeq2GeneNorm Group
RPPAArray Group
gistict Group
miRNASeqGene Group
%> h5ls miniacc.h5mu/mod/Mutations
X Dataset {90, 97}
obs Group
var Group
%> h5ls miniacc.h5mu/mod/RNASeq2GeneNorm
X Dataset {79, 198}
obs Group
uns Group
var Group
We can see that as we descend in the Group hierarchy we find Dataset instances named X.
For your example, which was generated using python and not MuData::writeH5MU, we see
%> h5ls example_subsetted_mudata.h5mu
mod Group
obs Group
obsm Group
obsmap Group
obsp Group
uns Group
var Group
varm Group
varmap Group
varp Group
%> h5ls example_subsetted_mudata.h5mu/mod
prot Group
rna Group
%> h5ls example_subsetted_mudata.h5mu/mod/prot
X Dataset {21, 145}
layers Group
obs Group
obsm Group
obsp Group
uns Group
var Group
varm Group
varp Group
I am not sure that the MuData readH5MU is suited for this structure. You should contact the MuData authors for clarification. In their DESCRIPTION I see
Package: MuData
Title: Serialization for MultiAssayExperiment Objects
implying that the round trips must start with MAE. Here's where to post your question: https://github.com/ilia-kats/MuData/issues
Finally I do not think it will be difficult to dig data out of your .h5mu file using rhdf5 and/or reticulate with h5py imported. You might post a query to support.bioconductor.org where someone may have already tackled this.
Thanks for looking into this @vjcitn! Helpful for me too.
rhdf5::h5ls()
provides an alternative to the h5ls
command line tool; especially useful with dplyr::as_tibble()
and friends.
For what it's worth the (under development github) anndataR package provides some useful building blocks for a 'native' R parser, and the anndata / mudata structures are very similar. Here's some code that starts down the path (the RNA experiment, obs and assays only...; this uses the 'devel' version of Bioconductor and hence rhdf5, which includes an important extension for managing HDF5 'enum' types).
remotes::install_github("scverse/anndataR")
library(rhdf5)
library(dplyr)
## helper function to get names of group
group_names <-
function(tbl, group_name)
{
tbl |>
filter(.data$group %in% group_name) |>
pull(.data$name)
}
##
## worked example
##
file <- "example_subsetted_mudata.h5mu"
## file content
tbl <-
h5ls(file) |>
tibble()
## sketch content of single assay; similarly for var and uns
assay <- "/mod/rna"
layers_group <- paste0(assay, "/layers")
obs_group <- paste0(assay, "/obs")
obsm_group <- paste0(assay, "/obsm")
obsp_group <- paste0(assay, "/obsp")
## layers
layers_names <- group_names(tbl, layers_group)
assays <-
lapply(setNames(nm = layers_names), \(name, file, group) {
path <- paste0(group, "/", name)
## FIXME: dgCMatrix, or HDF5Array::TENxMatrix (on-disk)?
anndataR:::read_h5ad_sparse_array(file, path) |>
Matrix::t()
}, file, layers_group)
## obs
obs <-
anndataR:::read_h5ad_data_frame(file, obs_group) |>
tibble()
## obsm
obsm_names <- group_names(tbl, obsm_group)
obsm <-
lapply(setNames(nm = obsm_names), \(name, file, group) {
path <- paste0(group, "/", name)
anndataR:::read_h5ad_dense_array(file, path)
}, file, obsm_group)
## obsp
obsp_names <- group_names(tbl, obsp_group)
obsp <-
lapply(setNames(nm = obsp_names), \(name, file, group) {
path <- paste0(group, "/", name)
## FIXME: dgCMatrix, transposed?
anndataR:::read_h5ad_sparse_array(file, path)
}, file, obsp_group)
## assemble to SingleCellExperiment
sce <- SingleCellExperiment::SingleCellExperiment(
assays,
colData = obs,
reducedDims = obsm,
colPairs = obsp
)
It's interesting / unfortunate that constructing a MAE loads most of the data (even if the 'large' matrix data were to be left on-disk via TENxMatrix) into memory...
Excellent thank you everyone. I will have a play around and see if I can get it to work! I'll let you know if I come up with anything robust. :)
Hi
Sorry, I'm unfamiliar with converting between MuData and R format.
I am unable to load my h5mu data in. I keep getting "Error in h5checktype(). The provided H5Identifier is not a dataset identifier." I'm not sure how to fix this. Thanks!
From the rhdf5::H5Fopen, it appears to still match the format from the dummy example data.
I exported my h5mu with just mdata.write("FILENAME.h5mu")