theislab / zellkonverter

Conversion between scRNA-seq objects
https://theislab.github.io/zellkonverter/
Other
145 stars 27 forks source link

support transposed assay data #42

Closed mtmorgan closed 3 years ago

mtmorgan commented 3 years ago

My understanding (from @hpages) is that transposed assay data matrices should be supported in HDF files, but after downloading Human Cell Atlas h5ad data at https://data.humancellatlas.org/explore/projects/f83165c5-e2ea-4d15-a5cf-33f3550bffde/expression-matrices?catalog=dcp2 (look for the h5ad files) I see

> sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
Warning message:
In .extract_or_skip_assay(skip_assays = skip_assays, hdf5_backed = hdf5_backed,  :
  'X' matrix does not support transposition and has been skipped

and indeed the matrix is all zeros.

This can be reproduced programmatically with

BiocManager::install("Bioconductor/hca")
library("hca"); library("dplyr")

filters <- filters(
    projectId = list(is = "f83165c5-e2ea-4d15-a5cf-33f3550bffde"),
    fileFormat = list(is = "h5ad")
)
files <-
    files(filters) %>%
    head(1)

file_path <- files_download(files)

sce <- zellkonverter::readH5AD(file_path, use_hdf5 = TRUE)
hpages commented 3 years ago

Works fine as far as HDF5Array::H5ADMatrix() is concerned:

file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z 
#               "/tmp/Rtmpvwuwxz/a5be939a99f58_a5be939a99f58.h5ad" 

library(HDF5Array)

H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
#                [,1]       [,2]       [,3] ... [,4890] [,4891]
#     [1,]          0          0          0   .       0       0
#     [2,]          0          0          0   .       0       0
#     [3,]          0          0          0   .       0       0
#     [4,]          0          0          0   .       0       0
#     [5,]          0          0          0   .       0       0
#      ...          .          .          .   .       .       .
# [30187,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30188,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30189,] 0.06667092 0.00000000 0.00000000   .       0       0
# [30190,] 0.00000000 0.00000000 0.00000000   .       0       0
# [30191,] 0.00000000 0.00000000 0.00000000   .       0       0
# Warning message:
# In .load_h5ad_dimnames(filepath) :
#   could not find dimnames (normally stored in datasets '/var/_index' and
#   '/obs/_index') in this .h5ad file

(I still need to work on the dimnames part.)

sessionInfo():

> sessionInfo()
R Under development (unstable) (2021-02-05 r79941)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r79941/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79941/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] HDF5Array_1.19.6     rhdf5_2.35.2         DelayedArray_0.17.9 
 [4] IRanges_2.25.6       S4Vectors_0.29.7     MatrixGenerics_1.3.1
 [7] matrixStats_0.58.0   BiocGenerics_0.37.1  Matrix_1.3-2        
[10] hca_0.99.0          

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6           pillar_1.5.0         compiler_4.1.0      
 [4] dbplyr_2.1.0         rhdf5filters_1.3.4   tools_4.1.0         
 [7] bit_4.0.4            lattice_0.20-41      jsonlite_1.7.2      
[10] BiocFileCache_1.15.1 RSQLite_2.2.3        memoise_2.0.0       
[13] lifecycle_1.0.0      tibble_3.1.0         pkgconfig_2.0.3     
[16] rlang_0.4.10         DBI_1.1.1            cli_2.3.1           
[19] rstudioapi_0.13      filelock_1.0.2       curl_4.3            
[22] fastmap_1.1.0        withr_2.4.1          dplyr_1.0.4         
[25] httr_1.4.2           generics_0.1.0       vctrs_0.3.6         
[28] rappdirs_0.3.3       grid_4.1.0           bit64_4.0.5         
[31] tidyselect_1.1.0     glue_1.4.2           R6_2.5.0            
[34] fansi_0.4.2          Rhdf5lib_1.13.4      purrr_0.3.4         
[37] tidyr_1.1.2          blob_1.2.1           magrittr_2.0.1      
[40] ps_1.5.0             ellipsis_0.3.1       assertthat_0.2.1    
[43] utf8_1.1.4           cachem_1.0.4         crayon_1.4.1        
LTLA commented 3 years ago

I believe this issue is discussed in #37 and #41.

hpages commented 3 years ago

FWIW, starting with HDF5Array 1.19.7, H5ADMatrix() should always be able to find the dimnames (unless the AnnData folks have another sneaky way to hide them in the file):

file_path
# 6d4fedcf-857d-5fbb-9928-8b9605500a69-2020-11-20T09:03:11.285229Z 
#                 "/tmp/RtmpxFYLjF/a8e7d957870a_a8e7d957870a.h5ad" 

library(HDF5Array)

H5ADMatrix(file_path)
# <30191 x 4891> sparse matrix of class H5ADMatrix and type "double":
#             24087_4#281-vento18 ...  23728_8#64-vento18
# MIR1302-2HG                   0   .                   0
#     FAM138A                   0   .                   0
#  FO538757.2                   0   .                   0
#      FAM87B                   0   .                   0
#   LINC00115                   0   .                   0
#         ...                   .   .                   .
#  AP001469.3          0.00000000   .                   0
#  AC136352.2          0.00000000   .                   0
#  BX004987.1          0.06667092   .                   0
#  AC145212.1          0.00000000   .                   0
#  AC213203.2          0.00000000   .                   0
lazappi commented 3 years ago

Should be fixed in devel with v1.1.5

royfrancis commented 1 year ago

I have the same issue with bioconductor-zellkonverter 1.8.0.

lazappi commented 1 year ago

@royfrancis Can you please provide some code/file to reproduce the issue? I'm hoping this exact issue is solved but you could be getting the same warning for other reasons.

royfrancis commented 1 year ago

Many hours later...

Now I have a docker image to reproduce the error.

docker run --rm --user "$(id -u):$(id -g)" -v ${PWD}:/zell/work royfrancis/zellkonverter:1.8.0

The h5ad file that I am converting is unfortunately huge. Extended GBmap (11GB) from here. Rename local.h5ad to file.h5ad. Then run docker:

docker run --user "$(id -u):$(id -g)" --rm -v ${PWD}:/zell/work zell:1.8

Warning message:
'X' matrix does not support transposition and has been skipped

It worked with a few other smaller files that I tried. For example 1295 cells or 7999 cells or 141401 cells.

Maybe it's an issue with the h5ad file? 🤔


To build image from scratch, place the three files below into a directory:

Dockerfile

FROM r-base:4.1.1
LABEL Description="Docker image for zellkonverter"

RUN apt-get update \
   && apt-get install --no-install-recommends -y build-essential \
   && apt-get clean \
   && rm -rf /var/lib/apt/lists/* \
   && mkdir /zell \
   && mkdir /zell/scripts /zell/work

COPY env.yml convert.R /zell/scripts/

ENV CONDA_DIR=$HOME/miniconda3
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh \
  && /bin/bash ~/miniconda.sh -b -p $CONDA_DIR \
  && rm ~/miniconda.sh
ENV PATH=$CONDA_DIR/bin:$PATH

RUN conda install mamba -n base -c conda-forge
RUN mamba env create -f /zell/scripts/env.yml
WORKDIR /zell/work
CMD conda run -n zell Rscript /zell/scripts/convert.R

env.yml

name: zell
channels:
 - bioconda
 - conda-forge
dependencies:
 - python=3.7
 - bioconductor-zellkonverter
 - pip
 - pip:
    - anndata

convert.R

library(zellkonverter)
anndata <- reticulate::import("anndata")
if(!file.exists("file.h5ad")) stop(file.path(getwd(),"file.h5ad does not exist."))
message(paste0("Reading ",getwd(),"/file.h5ad..."))
adata <- anndata$read_h5ad("file.h5ad")
message("Converting ANN to SCE...")
sce <- zellkonverter::AnnData2SCE(adata)
message(paste0("Exporting ",getwd(),"/sce.rds..."))
saveRDS(sce,"sce.Rds")
message("Completed.")

Run this in that directory to build the docker image. docker build -t zellkonverter:1.8.0 .

lazappi commented 1 year ago

I have had a quick look at this. I think what is happening is that the dataset is too big and transposing the X matrix in R fails (probably because of running out of memory). The warning message isn't very helpful in this case so that can be improved but maybe we should also look again at doing the transposition in Python as I suggested here #9.

mtmorgan commented 1 year ago

I don't know the context but if the goal is to support Bioconductor work flows does it make sense to

HDF5Array::H5SparseMatrix(glioblastoma_h5ad_file, "/X")

which has done the transformation anyway? And is more-or-less instant, since it does not actually load the entire data into memory?

Oops I see a similar comment from @hpages upstream. And also perhaps that the problem is in AnnData2SCE(), but one should just be going for the R representation directly, without python?

lazappi commented 1 year ago

Yeah I think this would be the correct approach for the R reader but not sure it helps with the Python reader where we already have the data in memory. This kind of scalability is probably another reason to work on the R reader more, at the very least you avoid having two copies of everything.

jenellewallace commented 1 year ago

Was this issue ever solved? I have the same problem:

sce = readH5AD(file, use_hdf5=TRUE) Warning: 'X' matrix does not support transposition and has been skippedâ„ą Using stored X_name value 'counts'

And the resulting counts matrix is all zeroes.

royfrancis commented 1 year ago

Zellkonverter didn't work for my file of interest, so my solution was simply to export as text and convert into R format. The code I used is below. This solution is extremely memory intensive depending on the size of your dataset. There are probably better ways to do this.

H5AD -> CSV -> R

# python

import scanpy as sc
import numpy as np
import pandas as pd

print("Reading data...")
adata = sc.read_h5ad("file.h5ad")

print("Writing data...")
t=adata.raw.X.toarray()
pd.DataFrame(data=t, index=adata.obs_names, columns=adata.raw.var_names).to_csv('raw-counts.csv')
# comma separated csv with header, no rownames
pd.DataFrame(adata.obs).to_csv("metadata.csv")
# R

library(Seurat)

message("Reading counts...")
x <- read.csv("raw-counts.csv",header=TRUE)
rownames(x) <- x[,1]
x[,1] <- NULL
print(dim(x))
print(x[1:5,1:5])

message("Reading metadata...")
m <- read.csv("metadata.csv",header=TRUE)
rownames(m) <- m[,1]
colnames(m)[1] <- "sample"
print(dim(m))
print(head(m))

message("Writing seurat object...")
saveRDS(
  CreateSeuratObject(counts=t(x),meta.data=m,project="extended",min.cells=0,min.features=0),
  "seurat.Rds"
)
jenellewallace commented 1 year ago

@royfrancis Thanks for your response! I ended up being able to just download the raw data I needed from a public repository into R format so skipping the h5 step. Hope this issue will be fixed soon!

lazappi commented 1 year ago

Was this issue ever solved? I have the same problem:

sce = readH5AD(file, use_hdf5=TRUE) Warning: 'X' matrix does not support transposition and has been skippedâ„ą Using stored X_name value 'counts'

And the resulting counts matrix is all zeroes.

@jenellewallace It seems like you came up with a solution for your case but for anybody else we made updates to the R reader in the v1.10.0 release which should help with these kinds of issues (and should be further improved in the next release).

jgilis commented 1 year ago

Hi everyone,

I would like to jump in on this. I am working with the dataset from GSE174188 (GSE174188_CLUES1_adjusted.h5ad.gz).

In R/4.1.3, zellkonverter 1.4.0, I was able to simply obtain the data with

lupus <- readH5AD(file=lupus_file, use_hdf5 = TRUE, raw = TRUE, verbose = TRUE)

However, after switching to R/4.2.0, zellkoverter 1.8.0, the same code leads to following error:

Error in validObject(.Object) :
 invalid class “SummarizedExperiment” object:
  'x@assays' is not parallel to 'x'
Calls: readH5AD ... new_SummarizedExperiment -> new -> initialize -> initialize -> validObject
In addition: Warning messages:
1: raw 'X' matrix does not support transposition and has been skipped
2: The names of these selected raw var columns have been modified to match R
conventions:
'feature_types-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0'
->
'feature_types.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0'
Execution halted

It turns out that that the warning "raw 'X' matrix does not support transposition and has been skipped" is the underlying cause of the assays and rowdata dimensions being mismatched, which finally results in SummarizedExperiment complaining.

In the light of this thread, I also tried with zellkonverter 1.11.0, but this did not resolve my issues.

jgilis commented 1 year ago
> sessionInfo()
R version 4.2.0 alpha (2022-04-04 r82084)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] dplyr_1.1.2                 scuttle_1.8.4              
 [3] SingleCellExperiment_1.20.1 SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0              GenomicRanges_1.50.2       
 [7] GenomeInfoDb_1.34.9         IRanges_2.32.0             
 [9] S4Vectors_0.36.2            BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] reticulate_1.28             zellkonverter_1.11.0       
[15] basilisk.utils_1.10.0       basilisk_1.10.2            

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10               here_1.0.1               
 [3] dir.expiry_1.6.0          lattice_0.21-8           
 [5] png_0.1-8                 rprojroot_2.0.3          
 [7] digest_0.6.31             utf8_1.2.3               
 [9] R6_2.5.1                  evaluate_0.20            
[11] pillar_1.9.0              sparseMatrixStats_1.10.0 
[13] zlibbioc_1.44.0           rlang_1.1.0              
[15] rstudioapi_0.14           Matrix_1.5-4             
[17] rmarkdown_2.21            BiocParallel_1.32.6      
[19] RCurl_1.98-1.12           beachmat_2.14.2          
[21] HDF5Array_1.26.0          DelayedArray_0.24.0      
[23] compiler_4.2.0            xfun_0.39                
[25] pkgconfig_2.0.3           htmltools_0.5.5          
[27] tidyselect_1.2.0          tibble_3.2.1             
[29] GenomeInfoDbData_1.2.9    codetools_0.2-19         
[31] fansi_1.0.4               withr_2.5.0              
[33] rhdf5filters_1.10.1       bitops_1.0-7             
[35] grid_4.2.0                jsonlite_1.8.4           
[37] lifecycle_1.0.3           magrittr_2.0.3           
[39] cli_3.6.1                 XVector_0.38.0           
[41] renv_0.17.3               fs_1.6.2                 
[43] DelayedMatrixStats_1.20.0 filelock_1.0.2           
[45] generics_0.1.3            vctrs_0.6.2              
[47] Rhdf5lib_1.20.0           tools_4.2.0              
[49] forcats_1.0.0             glue_1.6.2               
[51] parallel_4.2.0            fastmap_1.1.1            
[53] yaml_2.3.7                rhdf5_2.42.1             
[55] BiocManager_1.30.20       knitr_1.42 
> env <- zellkonverterAnnDataEnv(version)
> env
An object of class "BasiliskEnvironment"
Slot "envname":
[1] "zellkonverterAnnDataEnv-0.8.0"

Slot "pkgname":
[1] "zellkonverter"

Slot "packages":
 [1] "anndata==0.8.0"  "h5py==3.6.0"     "hdf5==1.12.1"    "natsort==8.1.0" 
 [5] "numpy==1.22.3"   "packaging==21.3" "pandas==1.4.2"   "python==3.8.13" 
 [9] "scipy==1.7.3"    "sqlite==3.38.2" 

Slot "channels":
[1] "conda-forge"

Slot "pip":
character(0)

Slot "paths":
character(0)
jgilis commented 1 year ago

Note that I am still able to load the data with older versions on the same machine, so I do not expect memory issues to be the culprit here...

One more hint: it seems that this line

raw_x <- zellkonverter:::.extract_or_skip_assay(skip_assays = skip_assays, 
    hdf5_backed = hdf5_backed, dims = dims, mat = adata$raw$X, 
    name = "raw 'X' matrix")
dim(raw_x$mat)

results in a 32K x 1.2M matrix in the old versions, but now only returns a 1999 x 1.2M matrix.

Let me know if I can provide any more info to resolve this!

lazappi commented 1 year ago

@jgilis I think this is a separate issue so I have moved it to #96