Closed mtmorgan closed 3 years ago
Works fine as far as HDF5Array::H5ADMatrix()
is concerned:
file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z
# "/tmp/Rtmpvwuwxz/a5be939a99f58_a5be939a99f58.h5ad"
library(HDF5Array)
H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
# [,1] [,2] [,3] ... [,4890] [,4891]
# [1,] 0 0 0 . 0 0
# [2,] 0 0 0 . 0 0
# [3,] 0 0 0 . 0 0
# [4,] 0 0 0 . 0 0
# [5,] 0 0 0 . 0 0
# ... . . . . . .
# [30187,] 0.00000000 0.00000000 0.00000000 . 0 0
# [30188,] 0.00000000 0.00000000 0.00000000 . 0 0
# [30189,] 0.06667092 0.00000000 0.00000000 . 0 0
# [30190,] 0.00000000 0.00000000 0.00000000 . 0 0
# [30191,] 0.00000000 0.00000000 0.00000000 . 0 0
# Warning message:
# In .load_h5ad_dimnames(filepath) :
# could not find dimnames (normally stored in datasets '/var/_index' and
# '/obs/_index') in this .h5ad file
(I still need to work on the dimnames
part.)
sessionInfo():
> sessionInfo()
R Under development (unstable) (2021-02-05 r79941)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10
Matrix products: default
BLAS: /home/hpages/R/R-4.1.r79941/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79941/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] HDF5Array_1.19.6 rhdf5_2.35.2 DelayedArray_0.17.9
[4] IRanges_2.25.6 S4Vectors_0.29.7 MatrixGenerics_1.3.1
[7] matrixStats_0.58.0 BiocGenerics_0.37.1 Matrix_1.3-2
[10] hca_0.99.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 pillar_1.5.0 compiler_4.1.0
[4] dbplyr_2.1.0 rhdf5filters_1.3.4 tools_4.1.0
[7] bit_4.0.4 lattice_0.20-41 jsonlite_1.7.2
[10] BiocFileCache_1.15.1 RSQLite_2.2.3 memoise_2.0.0
[13] lifecycle_1.0.0 tibble_3.1.0 pkgconfig_2.0.3
[16] rlang_0.4.10 DBI_1.1.1 cli_2.3.1
[19] rstudioapi_0.13 filelock_1.0.2 curl_4.3
[22] fastmap_1.1.0 withr_2.4.1 dplyr_1.0.4
[25] httr_1.4.2 generics_0.1.0 vctrs_0.3.6
[28] rappdirs_0.3.3 grid_4.1.0 bit64_4.0.5
[31] tidyselect_1.1.0 glue_1.4.2 R6_2.5.0
[34] fansi_0.4.2 Rhdf5lib_1.13.4 purrr_0.3.4
[37] tidyr_1.1.2 blob_1.2.1 magrittr_2.0.1
[40] ps_1.5.0 ellipsis_0.3.1 assertthat_0.2.1
[43] utf8_1.1.4 cachem_1.0.4 crayon_1.4.1
I believe this issue is discussed in #37 and #41.
FWIW, starting with HDF5Array 1.19.7, H5ADMatrix()
should always be able to find the dimnames (unless the AnnData folks have another sneaky way to hide them in the file):
file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z
# "/tmp/RtmpxFYLjF/a8e7d957870a_a8e7d957870a.h5ad"
library(HDF5Array)
H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
# 24087_4#281-vento18 ... 23728_8#64-vento18
# MIR1302-2HG 0 . 0
# FAM138A 0 . 0
# FO538757.2 0 . 0
# FAM87B 0 . 0
# LINC00115 0 . 0
# ... . . .
# AP001469.3 0.00000000 . 0
# AC136352.2 0.00000000 . 0
# BX004987.1 0.06667092 . 0
# AC145212.1 0.00000000 . 0
# AC213203.2 0.00000000 . 0
Should be fixed in devel with v1.1.5
I have the same issue with bioconductor-zellkonverter 1.8.0
.
@royfrancis Can you please provide some code/file to reproduce the issue? I'm hoping this exact issue is solved but you could be getting the same warning for other reasons.
Many hours later...
Now I have a docker image to reproduce the error.
docker run --rm --user "$(id -u):$(id -g)" -v ${PWD}:/zell/work royfrancis/zellkonverter:1.8.0
The h5ad file that I am converting is unfortunately huge. Extended GBmap (11GB) from here. Rename local.h5ad
to file.h5ad
. Then run docker:
docker run --user "$(id -u):$(id -g)" --rm -v ${PWD}:/zell/work zell:1.8
Warning message:
'X' matrix does not support transposition and has been skipped
It worked with a few other smaller files that I tried. For example 1295 cells or 7999 cells or 141401 cells.
Maybe it's an issue with the h5ad file? 🤔
To build image from scratch, place the three files below into a directory:
Dockerfile
FROM r-base:4.1.1
LABEL Description="Docker image for zellkonverter"
RUN apt-get update \
&& apt-get install --no-install-recommends -y build-essential \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& mkdir /zell \
&& mkdir /zell/scripts /zell/work
COPY env.yml convert.R /zell/scripts/
ENV CONDA_DIR=$HOME/miniconda3
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh \
&& /bin/bash ~/miniconda.sh -b -p $CONDA_DIR \
&& rm ~/miniconda.sh
ENV PATH=$CONDA_DIR/bin:$PATH
RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f /zell/scripts/env.yml
WORKDIR /zell/work
CMD conda run -n zell Rscript /zell/scripts/convert.R
env.yml
name: zell
channels:
- bioconda
- conda-forge
dependencies:
- python=3.7
- bioconductor-zellkonverter
- pip
- pip:
- anndata
convert.R
library(zellkonverter)
anndata <- reticulate::import("anndata")
if(!file.exists("file.h5ad")) stop(file.path(getwd(),"file.h5ad does not exist."))
message(paste0("Reading ",getwd(),"/file.h5ad..."))
adata <- anndata$read_h5ad("file.h5ad")
message("Converting ANN to SCE...")
sce <- zellkonverter::AnnData2SCE(adata)
message(paste0("Exporting ",getwd(),"/sce.rds..."))
saveRDS(sce,"sce.Rds")
message("Completed.")
Run this in that directory to build the docker image.
docker build -t zellkonverter:1.8.0 .
I have had a quick look at this. I think what is happening is that the dataset is too big and transposing the X
matrix in R fails (probably because of running out of memory). The warning message isn't very helpful in this case so that can be improved but maybe we should also look again at doing the transposition in Python as I suggested here #9.
I don't know the context but if the goal is to support Bioconductor work flows does it make sense to
HDF5Array::H5SparseMatrix(glioblastoma_h5ad_file, "/X")
which has done the transformation anyway? And is more-or-less instant, since it does not actually load the entire data into memory?
Oops I see a similar comment from @hpages upstream. And also perhaps that the problem is in AnnData2SCE()
, but one should just be going for the R representation directly, without python?
Yeah I think this would be the correct approach for the R reader but not sure it helps with the Python reader where we already have the data in memory. This kind of scalability is probably another reason to work on the R reader more, at the very least you avoid having two copies of everything.
Was this issue ever solved? I have the same problem:
sce = readH5AD(file, use_hdf5=TRUE) Warning: 'X' matrix does not support transposition and has been skippedâ„ą Using stored X_name value 'counts'
And the resulting counts matrix is all zeroes.
Zellkonverter didn't work for my file of interest, so my solution was simply to export as text and convert into R format. The code I used is below. This solution is extremely memory intensive depending on the size of your dataset. There are probably better ways to do this.
H5AD -> CSV -> R
# python
import scanpy as sc
import numpy as np
import pandas as pd
print("Reading data...")
adata = sc.read_h5ad("file.h5ad")
print("Writing data...")
t=adata.raw.X.toarray()
pd.DataFrame(data=t, index=adata.obs_names, columns=adata.raw.var_names).to_csv('raw-counts.csv')
# comma separated csv with header, no rownames
pd.DataFrame(adata.obs).to_csv("metadata.csv")
# R
library(Seurat)
message("Reading counts...")
x <- read.csv("raw-counts.csv",header=TRUE)
rownames(x) <- x[,1]
x[,1] <- NULL
print(dim(x))
print(x[1:5,1:5])
message("Reading metadata...")
m <- read.csv("metadata.csv",header=TRUE)
rownames(m) <- m[,1]
colnames(m)[1] <- "sample"
print(dim(m))
print(head(m))
message("Writing seurat object...")
saveRDS(
CreateSeuratObject(counts=t(x),meta.data=m,project="extended",min.cells=0,min.features=0),
"seurat.Rds"
)
@royfrancis Thanks for your response! I ended up being able to just download the raw data I needed from a public repository into R format so skipping the h5 step. Hope this issue will be fixed soon!
Was this issue ever solved? I have the same problem:
sce = readH5AD(file, use_hdf5=TRUE) Warning: 'X' matrix does not support transposition and has been skippedâ„ą Using stored X_name value 'counts'
And the resulting counts matrix is all zeroes.
@jenellewallace It seems like you came up with a solution for your case but for anybody else we made updates to the R reader in the v1.10.0
release which should help with these kinds of issues (and should be further improved in the next release).
Hi everyone,
I would like to jump in on this. I am working with the dataset from GSE174188 (GSE174188_CLUES1_adjusted.h5ad.gz
).
In R/4.1.3, zellkonverter 1.4.0
, I was able to simply obtain the data with
lupus <- readH5AD(file=lupus_file, use_hdf5 = TRUE, raw = TRUE, verbose = TRUE)
However, after switching to R/4.2.0, zellkoverter 1.8.0
, the same code leads to following error:
Error in validObject(.Object) :
invalid class “SummarizedExperiment” object:
'x@assays' is not parallel to 'x'
Calls: readH5AD ... new_SummarizedExperiment -> new -> initialize -> initialize -> validObject
In addition: Warning messages:
1: raw 'X' matrix does not support transposition and has been skipped
2: The names of these selected raw var columns have been modified to match R
conventions:
'feature_types-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0'
->
'feature_types.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0'
Execution halted
It turns out that that the warning "raw 'X' matrix does not support transposition and has been skipped" is the underlying cause of the assays and rowdata dimensions being mismatched, which finally results in SummarizedExperiment complaining.
In the light of this thread, I also tried with zellkonverter 1.11.0
, but this did not resolve my issues.
> sessionInfo()
R version 4.2.0 alpha (2022-04-04 r82084)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] dplyr_1.1.2 scuttle_1.8.4
[3] SingleCellExperiment_1.20.1 SummarizedExperiment_1.28.0
[5] Biobase_2.58.0 GenomicRanges_1.50.2
[7] GenomeInfoDb_1.34.9 IRanges_2.32.0
[9] S4Vectors_0.36.2 BiocGenerics_0.44.0
[11] MatrixGenerics_1.10.0 matrixStats_0.63.0
[13] reticulate_1.28 zellkonverter_1.11.0
[15] basilisk.utils_1.10.0 basilisk_1.10.2
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 here_1.0.1
[3] dir.expiry_1.6.0 lattice_0.21-8
[5] png_0.1-8 rprojroot_2.0.3
[7] digest_0.6.31 utf8_1.2.3
[9] R6_2.5.1 evaluate_0.20
[11] pillar_1.9.0 sparseMatrixStats_1.10.0
[13] zlibbioc_1.44.0 rlang_1.1.0
[15] rstudioapi_0.14 Matrix_1.5-4
[17] rmarkdown_2.21 BiocParallel_1.32.6
[19] RCurl_1.98-1.12 beachmat_2.14.2
[21] HDF5Array_1.26.0 DelayedArray_0.24.0
[23] compiler_4.2.0 xfun_0.39
[25] pkgconfig_2.0.3 htmltools_0.5.5
[27] tidyselect_1.2.0 tibble_3.2.1
[29] GenomeInfoDbData_1.2.9 codetools_0.2-19
[31] fansi_1.0.4 withr_2.5.0
[33] rhdf5filters_1.10.1 bitops_1.0-7
[35] grid_4.2.0 jsonlite_1.8.4
[37] lifecycle_1.0.3 magrittr_2.0.3
[39] cli_3.6.1 XVector_0.38.0
[41] renv_0.17.3 fs_1.6.2
[43] DelayedMatrixStats_1.20.0 filelock_1.0.2
[45] generics_0.1.3 vctrs_0.6.2
[47] Rhdf5lib_1.20.0 tools_4.2.0
[49] forcats_1.0.0 glue_1.6.2
[51] parallel_4.2.0 fastmap_1.1.1
[53] yaml_2.3.7 rhdf5_2.42.1
[55] BiocManager_1.30.20 knitr_1.42
> env <- zellkonverterAnnDataEnv(version)
> env
An object of class "BasiliskEnvironment"
Slot "envname":
[1] "zellkonverterAnnDataEnv-0.8.0"
Slot "pkgname":
[1] "zellkonverter"
Slot "packages":
[1] "anndata==0.8.0" "h5py==3.6.0" "hdf5==1.12.1" "natsort==8.1.0"
[5] "numpy==1.22.3" "packaging==21.3" "pandas==1.4.2" "python==3.8.13"
[9] "scipy==1.7.3" "sqlite==3.38.2"
Slot "channels":
[1] "conda-forge"
Slot "pip":
character(0)
Slot "paths":
character(0)
Note that I am still able to load the data with older versions on the same machine, so I do not expect memory issues to be the culprit here...
One more hint: it seems that this line
raw_x <- zellkonverter:::.extract_or_skip_assay(skip_assays = skip_assays,
hdf5_backed = hdf5_backed, dims = dims, mat = adata$raw$X,
name = "raw 'X' matrix")
dim(raw_x$mat)
results in a 32K x 1.2M matrix in the old versions, but now only returns a 1999 x 1.2M matrix.
Let me know if I can provide any more info to resolve this!
@jgilis I think this is a separate issue so I have moved it to #96
My understanding (from @hpages) is that transposed assay data matrices should be supported in HDF files, but after downloading Human Cell Atlas h5ad data at https://data.humancellatlas.org/explore/projects/f83165c5-e2ea-4d15-a5cf-33f3550bffde/expression-matrices?catalog=dcp2 (look for the h5ad files) I see
and indeed the matrix is all zeros.
This can be reproduced programmatically with