Reading `raw` data and differences between Python and R reader

PeteHaitch commented 2 years ago

Thanks for making it possible to read h5ad files into R.

I had a few issues/questions after trying to get the raw count matrix from a public h5ad file. Some of these points were touched on in https://github.com/theislab/zellkonverter/issues/57 and https://github.com/theislab/zellkonverter/issues/63 but I hoped to re-visit and clarify some things. Apologies if these questions are naive or misguided; I'm not very familiar with AnnData format and the structure of the particular h5ad file I was working with doesn't seem to match with that described in https://anndata.readthedocs.io/en/latest/fileformat-prose.html. The reprex demonstrates the points but I've summarised them below:

Why is raw stuck in as an altExp rather than an assay when using readH5AD()? For the example below, the raw data has the same dimensions as the X data and is essentially the counts() data I'd expect to find in a SingleCellExperiment. Ahh, the figure in ?AnnData2SCE suggests raw may not have the same dimensions as X because the latter may have undergone filtering, so I guess this precludes readH5AD() being able to assume raw is analogous to counts(sce)?
The R reader silently fails to load raw. I know it's documented that the R reader is experimental and may produce different results to the Python reader, so I'm guessing all the ... arguments to readH5AD() don't work with the R reader? If that's the case, then adding some documentation and/or a warning/error if a user tries to go down this path would be helpful (happy to add this if my understanding is correct).
A bit tangential perhaps to zellkonverter, but is the formatting of AnnData/h5ad files a bit loose in the wild? The AnnData documentation refers to the layers element as being standard (https://anndata.readthedocs.io/en/latest/fileformat-prose.html#mappings) but this particular file doesn't have it. It means I couldn't use HDF5Array::H5ADMatrix() to explore the raw data because it expects/requires it in /layers/raw (this was raised by @ltla in https://github.com/theislab/zellkonverter/issues/57#issuecomment-944868119). Any sense of whether it's worth trying to modify HDF5Array::H5ADMatrix() to account for the potential lack of /layers group in a .h5ad file?

Thanks, Pete

suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(rhdf5))
suppressPackageStartupMessages(library(zellkonverter))

# The 'Combined samples' .h5ad file from 
# https://cellxgene.cziscience.com/collections/62e8f058-9c37-48bc-9200-e767f318a8ec
path <- "local.h5ad"

# The raw count matrix is stored in `/raw`; see
# https://github.com/dpeerlab/SCLC_atlas-HTAN/issues/3#issuecomment-1152523012
h5ls(path)
#>                    group                               name       otype  dclass
#> 0                      /                                  X   H5I_GROUP        
#> 1                     /X                               data H5I_DATASET   FLOAT
#> 2                     /X                            indices H5I_DATASET INTEGER
#> 3                     /X                             indptr H5I_DATASET INTEGER
#> 4                      /                                obs   H5I_GROUP        
#> 5                   /obs                               Cell H5I_DATASET  STRING
#> 6                   /obs                              H_knn H5I_DATASET   FLOAT
#> 7                   /obs                           RBP_frac H5I_DATASET   FLOAT
#> 8                   /obs                       __categories   H5I_GROUP        
#> 9      /obs/__categories                              assay H5I_DATASET  STRING
#> 10     /obs/__categories             assay_ontology_term_id H5I_DATASET  STRING
#> 11     /obs/__categories                              batch H5I_DATASET  STRING
#> 12     /obs/__categories                          cell_type H5I_DATASET  STRING
#> 13     /obs/__categories                   cell_type_coarse H5I_DATASET  STRING
#> 14     /obs/__categories                     cell_type_fine H5I_DATASET  STRING
#> 15     /obs/__categories                  cell_type_general H5I_DATASET  STRING
#> 16     /obs/__categories                      cell_type_med H5I_DATASET  STRING
#> 17     /obs/__categories         cell_type_ontology_term_id H5I_DATASET  STRING
#> 18     /obs/__categories                           clusters H5I_DATASET  STRING
#> 19     /obs/__categories                  development_stage H5I_DATASET  STRING
#> 20     /obs/__categories development_stage_ontology_term_id H5I_DATASET  STRING
#> 21     /obs/__categories                            disease H5I_DATASET  STRING
#> 22     /obs/__categories           disease_ontology_term_id H5I_DATASET  STRING
#> 23     /obs/__categories                          ethnicity H5I_DATASET  STRING
#> 24     /obs/__categories         ethnicity_ontology_term_id H5I_DATASET  STRING
#> 25     /obs/__categories                              histo H5I_DATASET  STRING
#> 26     /obs/__categories                           organism H5I_DATASET  STRING
#> 27     /obs/__categories          organism_ontology_term_id H5I_DATASET  STRING
#> 28     /obs/__categories                            patient H5I_DATASET  STRING
#> 29     /obs/__categories                          procedure H5I_DATASET  STRING
#> 30     /obs/__categories                                sex H5I_DATASET  STRING
#> 31     /obs/__categories               sex_ontology_term_id H5I_DATASET  STRING
#> 32     /obs/__categories                             tissue H5I_DATASET  STRING
#> 33     /obs/__categories            tissue_ontology_term_id H5I_DATASET  STRING
#> 34     /obs/__categories                          treatment H5I_DATASET  STRING
#> 35                  /obs                              assay H5I_DATASET INTEGER
#> 36                  /obs             assay_ontology_term_id H5I_DATASET INTEGER
#> 37                  /obs                              batch H5I_DATASET INTEGER
#> 38                  /obs                          cell_type H5I_DATASET INTEGER
#> 39                  /obs                   cell_type_coarse H5I_DATASET INTEGER
#> 40                  /obs                     cell_type_fine H5I_DATASET INTEGER
#> 41                  /obs                  cell_type_general H5I_DATASET INTEGER
#> 42                  /obs                      cell_type_med H5I_DATASET INTEGER
#> 43                  /obs         cell_type_ontology_term_id H5I_DATASET INTEGER
#> 44                  /obs                           clusters H5I_DATASET INTEGER
#> 45                  /obs                  development_stage H5I_DATASET INTEGER
#> 46                  /obs development_stage_ontology_term_id H5I_DATASET INTEGER
#> 47                  /obs                            disease H5I_DATASET INTEGER
#> 48                  /obs           disease_ontology_term_id H5I_DATASET INTEGER
#> 49                  /obs                          ethnicity H5I_DATASET INTEGER
#> 50                  /obs         ethnicity_ontology_term_id H5I_DATASET INTEGER
#> 51                  /obs                              histo H5I_DATASET INTEGER
#> 52                  /obs                    is_primary_data H5I_DATASET    ENUM
#> 53                  /obs                            libsize H5I_DATASET   FLOAT
#> 54                  /obs                          mito_frac H5I_DATASET   FLOAT
#> 55                  /obs                             ngenes H5I_DATASET INTEGER
#> 56                  /obs                           organism H5I_DATASET INTEGER
#> 57                  /obs          organism_ontology_term_id H5I_DATASET INTEGER
#> 58                  /obs                            patient H5I_DATASET INTEGER
#> 59                  /obs                          procedure H5I_DATASET INTEGER
#> 60                  /obs                                sex H5I_DATASET INTEGER
#> 61                  /obs               sex_ontology_term_id H5I_DATASET INTEGER
#> 62                  /obs                             tissue H5I_DATASET INTEGER
#> 63                  /obs            tissue_ontology_term_id H5I_DATASET INTEGER
#> 64                  /obs                          treatment H5I_DATASET INTEGER
#> 65                     /                               obsm   H5I_GROUP        
#> 66                 /obsm                              X_pca H5I_DATASET   FLOAT
#> 67                 /obsm                             X_umap H5I_DATASET   FLOAT
#> 68                     /                                raw   H5I_GROUP        
#> 69                  /raw                                  X   H5I_GROUP        
#> 70                /raw/X                               data H5I_DATASET   FLOAT
#> 71                /raw/X                            indices H5I_DATASET INTEGER
#> 72                /raw/X                             indptr H5I_DATASET INTEGER
#> 73                  /raw                                var   H5I_GROUP        
#> 74              /raw/var                       __categories   H5I_GROUP        
#> 75 /raw/var/__categories                    feature_biotype H5I_DATASET  STRING
#> 76              /raw/var                    feature_biotype H5I_DATASET INTEGER
#> 77              /raw/var                           gene_ids H5I_DATASET  STRING
#> 78                     /                                uns   H5I_GROUP        
#> 79                  /uns                    X_normalization H5I_DATASET  STRING
#> 80                  /uns                  default_embedding H5I_DATASET  STRING
#> 81                  /uns                     schema_version H5I_DATASET  STRING
#> 82                  /uns                              title H5I_DATASET  STRING
#> 83                     /                                var   H5I_GROUP        
#> 84                  /var                       __categories   H5I_GROUP        
#> 85     /var/__categories                    feature_biotype H5I_DATASET  STRING
#> 86     /var/__categories                       feature_name H5I_DATASET  STRING
#> 87     /var/__categories                  feature_reference H5I_DATASET  STRING
#> 88                  /var                    feature_biotype H5I_DATASET INTEGER
#> 89                  /var                feature_is_filtered H5I_DATASET    ENUM
#> 90                  /var                       feature_name H5I_DATASET INTEGER
#> 91                  /var                  feature_reference H5I_DATASET INTEGER
#> 92                  /var                           gene_ids H5I_DATASET  STRING
#>            dim
#> 0             
#> 1    300868962
#> 2    300868962
#> 3       147138
#> 4             
#> 5       147137
#> 6       147137
#> 7       147137
#> 8             
#> 9            3
#> 10           3
#> 11          52
#> 12          11
#> 13           4
#> 14          24
#> 15           3
#> 16          16
#> 17          11
#> 18          56
#> 19          26
#> 20          26
#> 21           3
#> 22           3
#> 23           3
#> 24           3
#> 25           3
#> 26           1
#> 27           1
#> 28          43
#> 29           3
#> 30           2
#> 31           2
#> 32           8
#> 33           8
#> 34           9
#> 35      147137
#> 36      147137
#> 37      147137
#> 38      147137
#> 39      147137
#> 40      147137
#> 41      147137
#> 42      147137
#> 43      147137
#> 44      147137
#> 45      147137
#> 46      147137
#> 47      147137
#> 48      147137
#> 49      147137
#> 50      147137
#> 51      147137
#> 52      147137
#> 53      147137
#> 54      147137
#> 55      147137
#> 56      147137
#> 57      147137
#> 58      147137
#> 59      147137
#> 60      147137
#> 61      147137
#> 62      147137
#> 63      147137
#> 64      147137
#> 65            
#> 66 50 x 147137
#> 67  2 x 147137
#> 68            
#> 69            
#> 70   300868962
#> 71   300868962
#> 72      147138
#> 73            
#> 74            
#> 75           1
#> 76       22447
#> 77       22447
#> 78            
#> 79       ( 0 )
#> 80       ( 0 )
#> 81       ( 0 )
#> 82       ( 0 )
#> 83            
#> 84            
#> 85           1
#> 86       22447
#> 87           1
#> 88       22447
#> 89       22447
#> 90       22447
#> 91       22447
#> 92       22447

# Why is raw stuck in as an altExp rather than an assay?
sce_python_raw_altexp <- readH5AD(
  file = path,
  X_name = NULL,
  use_hdf5 = TRUE,
  reader = "python",
  raw = TRUE)
sce_python_raw_altexp
#> class: SingleCellExperiment 
#> dim: 22447 147137 
#> metadata(4): X_normalization default_embedding schema_version title
#> assays(1): X
#> rownames(22447): ENSG00000121410 ENSG00000148584 ... ENSG00000074755
#>   ENSG00000036549
#> rowData names(4): feature_biotype feature_is_filtered feature_name
#>   feature_reference
#> colnames(147137): RU1311A_T_1_165945547864806 RU1215_192110488599350
#>   ... RU1181B_236168014327141 RU1145_120772933872502
#> colData names(32): ngenes libsize ... ethnicity development_stage
#> reducedDimNames(2): X_pca X_umap
#> mainExpName: NULL
#> altExpNames(1): raw
unname(assay(altExp(sce_python_raw_altexp)))
#> <22447 x 147137> sparse matrix of class H5SparseMatrix and type "double":
#>               [,1]      [,2]      [,3] ... [,147136] [,147137]
#>     [1,]         0         1         0   .         1         0
#>     [2,]         0         0         0   .         1         0
#>     [3,]         0         0        15   .         0         0
#>     [4,]         0         0         0   .         0         0
#>     [5,]         0         0         0   .         0         0
#>      ...         .         .         .   .         .         .
#> [22443,]         0         0         0   .         0         0
#> [22444,]         0         0         0   .         0         0
#> [22445,]         0         0         1   .         0         0
#> [22446,]         0         0         0   .         1         0
#> [22447,]         0         0         0   .         2         0

# The R reader silently fails to load the raw count matrix.
sce_R_raw_altexp <- readH5AD(
  file = path,
  X_name = NULL,
  use_hdf5 = TRUE,
  reader = "R",
  raw = TRUE)
sce_R_raw_altexp
#> class: SingleCellExperiment 
#> dim: 22447 147137 
#> metadata(4): X_normalization default_embedding schema_version title
#> assays(1): X
#> rownames: NULL
#> rowData names(5): feature_biotype feature_is_filtered feature_name
#>   feature_reference gene_ids
#> colnames: NULL
#> colData names(33): Cell H_knn ... tissue_ontology_term_id treatment
#> reducedDimNames(2): X_pca X_umap
#> mainExpName: NULL
#> altExpNames(0):
unname(assay(altExp(sce_R_raw_altexp)))
#> Error in h(simpleError(msg, call)): error in evaluating the argument 'obj' in selecting a method for function 'unname': error in evaluating the argument 'x' in selecting a method for function 'assay': no available entries for 'altExp(<SingleCellExperiment>, ...)'

# Aside: The structure of this h5ad file means we can't use 
#        HDF5Array::H5ADMatrix() because it expects/requires it in 
#        `/layers/raw`; see 
#        https://github.com/theislab/zellkonverter/issues/57#issuecomment-944868119
HDF5Array::H5ADMatrix(path, "raw")
#> Error in H5ADMatrixSeed(filepath, layer = layer): HDF5 object "/layers/raw" does not exist in this HDF5 file.

^{Created on 2022-06-17 by the reprex package (v2.0.1)}

Session info

``` r sessionInfo() #> R version 4.2.0 (2022-04-22) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: CentOS Linux 7 (Core) #> #> Matrix products: default #> BLAS: /stornext/System/data/apps/R/R-4.2.0/lib64/R/lib/libRblas.so #> LAPACK: /stornext/System/data/apps/R/R-4.2.0/lib64/R/lib/libRlapack.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics utils stats4 methods base #> #> other attached packages: #> [1] zellkonverter_1.6.1 rhdf5_2.40.0 #> [3] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 #> [5] Biobase_2.56.0 GenomicRanges_1.48.0 #> [7] GenomeInfoDb_1.32.2 IRanges_2.30.0 #> [9] S4Vectors_0.34.0 BiocGenerics_0.42.0 #> [11] MatrixGenerics_1.8.0 matrixStats_0.62.0 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.8.3 compiler_4.2.0 highr_0.9 #> [4] XVector_0.36.0 basilisk.utils_1.8.0 rhdf5filters_1.8.0 #> [7] bitops_1.0-7 tools_4.2.0 grDevices_4.2.0 #> [10] zlibbioc_1.42.0 digest_0.6.29 jsonlite_1.8.0 #> [13] evaluate_0.15 lattice_0.20-45 png_0.1-7 #> [16] rlang_1.0.2 reprex_2.0.1 Matrix_1.4-1 #> [19] dir.expiry_1.4.0 DelayedArray_0.22.0 cli_3.3.0 #> [22] rstudioapi_0.13 filelock_1.0.2 parallel_4.2.0 #> [25] yaml_2.3.5 xfun_0.31 fastmap_1.1.0 #> [28] GenomeInfoDbData_1.2.8 withr_2.5.0 stringr_1.4.0 #> [31] knitr_1.39 fs_1.5.2 datasets_4.2.0 #> [34] rprojroot_2.0.3 grid_4.2.0 here_1.0.1 #> [37] reticulate_1.25 glue_1.6.2 HDF5Array_1.24.0 #> [40] basilisk_1.8.0 rmarkdown_2.14 Rhdf5lib_1.18.2 #> [43] magrittr_2.0.3 htmltools_0.5.2 stringi_1.7.6 #> [46] RCurl_1.98-1.6 ```

lazappi commented 2 years ago

Hi @PeteHaitch

I will try to answer your questions but definitely happy to discuss further if needed.

Why is raw stuck in as an altExp rather than an assay when using readH5AD()? For the example below, the raw data has the same dimensions as the X data and is essentially the counts() data I'd expect to find in a SingleCellExperiment. Ahh, the figure in ?AnnData2SCE suggests raw may not have the same dimensions as X because the latter may have undergone filtering, so I guess this precludes readH5AD() being able to assume raw is analogous to counts(sce)?

Yeah we can't assume the dimensions match. The .raw slot is basically a reduced AnnData with some slots missing. When subsetting the main AnnData it is subsetted by observations (cells) but not by variables (features). For that reason altExp seemed to be the best fit.

The use case for .raw is supposed to be something along the lines of storing the original data before filtering genes/selecting highly variable genes etc. so you can get the full dataset back later if needed. I think it's confusing for users and try to discourage its use but some people like it and there are a lot of examples out there that use it.

The R reader silently fails to load raw. I know it's documented that the R reader is experimental and may produce different results to the Python reader, so I'm guessing all the ... arguments to readH5AD() don't work with the R reader? If that's the case, then adding some documentation and/or a warning/error if a user tries to go down this path would be helpful (happy to add this if my understanding is correct).

Yeah the R reader is definitely underdeveloped (hoping to get some funding to work on that 🤞🏻). When it was added it was roughly equivalent but I have done a fair bit of work on the Python side which hasn't been carried over. That could definitely be better documented though.

A bit tangential perhaps to zellkonverter, but is the formatting of AnnData/h5ad files a bit loose in the wild? The AnnData documentation refers to the layers element as being standard (anndata.readthedocs.io/en/latest/fileformat-prose.html#mappings) but this particular file doesn't have it. It means I couldn't use HDF5Array::H5ADMatrix() to explore the raw data because it expects/requires it in /layers/raw (this was raised by @LTLA in Read alternative data with AnnData2SCE #57 (comment)). Any sense of whether it's worth trying to modify HDF5Array::H5ADMatrix() to account for the potential lack of /layers group in a .h5ad file?

Do you know what version of Python anndata was used to write the file? I think how layers are stored might be one of the things that was changed in v0.8.0. Maybe @ivirshup could jump in and help clarify this?

I don't know a lot about the internals of HDF5 files TBH. If there is a case that HDF5Array::H5ADMatrix() can't handle we should look at contributing something to address that.

ivirshup commented 2 years ago

Do you know what version of Python anndata was used to write the file?

That looks like it's from the 0.7.x release series at least.

You can see the docs for that format under the 0.7.8 docs (which I just made publicly visible): https://anndata.readthedocs.io/en/0.7.8/fileformat-prose.html

PeteHaitch commented 2 years ago

@lazappi : Thanks, Luke. Fingers crossed for the funding (sorry for the slow reply, COVID got me). @ivirshup: Thanks for making the docs public. So should a 'valid' anndata/.h5ad object from that version have a /layers group or is it optional?

lazappi commented 2 years ago

@ivirshup Any input on the layers question? ☝🏻

amoyguang1 commented 1 year ago

Following this topic, i have another questionas well.

My task is to convert anndata into sce. Image i have 20k genes in anndata, and 2k were slected as high varibale genes, When i convert adata into sce directly, i have 2k genes with both raw count in assay 'counts' and normalised count in assay 'X'. iF I would like to get the 20k genes, i use adata.raw.to_adata, to get them and save it as another anndata. However, when it is converted to sce, only normalised count are found but not raw count.

What shall i do here to get both counts and normalised counts for the 20k genes?

lazappi commented 1 year ago

@amoyguang1 Please open a separate issue for this

theislab / zellkonverter

Reading `raw` data and differences between Python and R reader #67