Open PeteHaitch opened 2 years ago
Hi @PeteHaitch
I will try to answer your questions but definitely happy to discuss further if needed.
- Why is
raw
stuck in as an altExp rather than an assay when usingreadH5AD()
? For the example below, theraw
data has the same dimensions as theX
data and is essentially thecounts()
data I'd expect to find in a SingleCellExperiment. Ahh, the figure in?AnnData2SCE
suggestsraw
may not have the same dimensions asX
because the latter may have undergone filtering, so I guess this precludesreadH5AD()
being able to assumeraw
is analogous tocounts(sce)
?
Yeah we can't assume the dimensions match. The .raw
slot is basically a reduced AnnData
with some slots missing. When subsetting the main AnnData it is subsetted by observations (cells) but not by variables (features). For that reason altExp
seemed to be the best fit.
The use case for .raw
is supposed to be something along the lines of storing the original data before filtering genes/selecting highly variable genes etc. so you can get the full dataset back later if needed. I think it's confusing for users and try to discourage its use but some people like it and there are a lot of examples out there that use it.
- The R reader silently fails to load
raw
. I know it's documented that the R reader is experimental and may produce different results to the Python reader, so I'm guessing all the...
arguments toreadH5AD()
don't work with the R reader? If that's the case, then adding some documentation and/or a warning/error if a user tries to go down this path would be helpful (happy to add this if my understanding is correct).
Yeah the R reader is definitely underdeveloped (hoping to get some funding to work on that 🤞🏻). When it was added it was roughly equivalent but I have done a fair bit of work on the Python side which hasn't been carried over. That could definitely be better documented though.
- A bit tangential perhaps to zellkonverter, but is the formatting of AnnData/
h5ad
files a bit loose in the wild? The AnnData documentation refers to thelayers
element as being standard (anndata.readthedocs.io/en/latest/fileformat-prose.html#mappings) but this particular file doesn't have it. It means I couldn't useHDF5Array::H5ADMatrix()
to explore theraw
data because it expects/requires it in/layers/raw
(this was raised by @LTLA in Read alternative data with AnnData2SCE #57 (comment)). Any sense of whether it's worth trying to modifyHDF5Array::H5ADMatrix()
to account for the potential lack of/layers
group in a.h5ad
file?
Do you know what version of Python anndata was used to write the file? I think how layers are stored might be one of the things that was changed in v0.8.0
. Maybe @ivirshup could jump in and help clarify this?
I don't know a lot about the internals of HDF5 files TBH. If there is a case that HDF5Array::H5ADMatrix()
can't handle we should look at contributing something to address that.
Do you know what version of Python anndata was used to write the file?
That looks like it's from the 0.7.x release series at least.
You can see the docs for that format under the 0.7.8 docs (which I just made publicly visible): https://anndata.readthedocs.io/en/0.7.8/fileformat-prose.html
@lazappi : Thanks, Luke. Fingers crossed for the funding (sorry for the slow reply, COVID got me).
@ivirshup: Thanks for making the docs public. So should a 'valid' anndata/.h5ad
object from that version have a /layers
group or is it optional?
@ivirshup Any input on the layers question? ☝🏻
Following this topic, i have another questionas well.
My task is to convert anndata into sce. Image i have 20k genes in anndata, and 2k were slected as high varibale genes, When i convert adata into sce directly, i have 2k genes with both raw count in assay 'counts' and normalised count in assay 'X'. iF I would like to get the 20k genes, i use adata.raw.to_adata, to get them and save it as another anndata. However, when it is converted to sce, only normalised count are found but not raw count.
What shall i do here to get both counts and normalised counts for the 20k genes?
@amoyguang1 Please open a separate issue for this
Thanks for making it possible to read
h5ad
files into R.I had a few issues/questions after trying to get the raw count matrix from a public
h5ad
file. Some of these points were touched on in https://github.com/theislab/zellkonverter/issues/57 and https://github.com/theislab/zellkonverter/issues/63 but I hoped to re-visit and clarify some things. Apologies if these questions are naive or misguided; I'm not very familiar with AnnData format and the structure of the particularh5ad
file I was working with doesn't seem to match with that described in https://anndata.readthedocs.io/en/latest/fileformat-prose.html. The reprex demonstrates the points but I've summarised them below:raw
stuck in as an altExp rather than an assay when usingreadH5AD()
? For the example below, theraw
data has the same dimensions as theX
data and is essentially thecounts()
data I'd expect to find in a SingleCellExperiment. Ahh, the figure in?AnnData2SCE
suggestsraw
may not have the same dimensions asX
because the latter may have undergone filtering, so I guess this precludesreadH5AD()
being able to assumeraw
is analogous tocounts(sce)
?raw
. I know it's documented that the R reader is experimental and may produce different results to the Python reader, so I'm guessing all the...
arguments toreadH5AD()
don't work with the R reader? If that's the case, then adding some documentation and/or a warning/error if a user tries to go down this path would be helpful (happy to add this if my understanding is correct).h5ad
files a bit loose in the wild? The AnnData documentation refers to thelayers
element as being standard (https://anndata.readthedocs.io/en/latest/fileformat-prose.html#mappings) but this particular file doesn't have it. It means I couldn't useHDF5Array::H5ADMatrix()
to explore theraw
data because it expects/requires it in/layers/raw
(this was raised by @ltla in https://github.com/theislab/zellkonverter/issues/57#issuecomment-944868119). Any sense of whether it's worth trying to modifyHDF5Array::H5ADMatrix()
to account for the potential lack of/layers
group in a.h5ad
file?Thanks, Pete
Created on 2022-06-17 by the reprex package (v2.0.1)
Session info
``` r sessionInfo() #> R version 4.2.0 (2022-04-22) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: CentOS Linux 7 (Core) #> #> Matrix products: default #> BLAS: /stornext/System/data/apps/R/R-4.2.0/lib64/R/lib/libRblas.so #> LAPACK: /stornext/System/data/apps/R/R-4.2.0/lib64/R/lib/libRlapack.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats graphics utils stats4 methods base #> #> other attached packages: #> [1] zellkonverter_1.6.1 rhdf5_2.40.0 #> [3] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 #> [5] Biobase_2.56.0 GenomicRanges_1.48.0 #> [7] GenomeInfoDb_1.32.2 IRanges_2.30.0 #> [9] S4Vectors_0.34.0 BiocGenerics_0.42.0 #> [11] MatrixGenerics_1.8.0 matrixStats_0.62.0 #> #> loaded via a namespace (and not attached): #> [1] Rcpp_1.0.8.3 compiler_4.2.0 highr_0.9 #> [4] XVector_0.36.0 basilisk.utils_1.8.0 rhdf5filters_1.8.0 #> [7] bitops_1.0-7 tools_4.2.0 grDevices_4.2.0 #> [10] zlibbioc_1.42.0 digest_0.6.29 jsonlite_1.8.0 #> [13] evaluate_0.15 lattice_0.20-45 png_0.1-7 #> [16] rlang_1.0.2 reprex_2.0.1 Matrix_1.4-1 #> [19] dir.expiry_1.4.0 DelayedArray_0.22.0 cli_3.3.0 #> [22] rstudioapi_0.13 filelock_1.0.2 parallel_4.2.0 #> [25] yaml_2.3.5 xfun_0.31 fastmap_1.1.0 #> [28] GenomeInfoDbData_1.2.8 withr_2.5.0 stringr_1.4.0 #> [31] knitr_1.39 fs_1.5.2 datasets_4.2.0 #> [34] rprojroot_2.0.3 grid_4.2.0 here_1.0.1 #> [37] reticulate_1.25 glue_1.6.2 HDF5Array_1.24.0 #> [40] basilisk_1.8.0 rmarkdown_2.14 Rhdf5lib_1.18.2 #> [43] magrittr_2.0.3 htmltools_0.5.2 stringi_1.7.6 #> [46] RCurl_1.98-1.6 ```