readH5AD issue with large dataset

MarcElosua commented 2 years ago

Hi @lazappi,

Thank you so much for putting together such an amazing tool! I've used it in the past to readH5AD without any issues and now when I run the examples they also work. The issue comes when I try to read the raw .h5 from cellranger output. When I set use_hdf5 = TRUE parameter I get the following error that I don't know what it means:

sce <- "{cr_sn}/jobs/out_merged/multiplexed1/outs/raw_feature_bc_matrix.h5" %>%
  glue() %>%
  here() %>%
  readH5AD(file = ., use_hdf5 = TRUE)

Error in py_get_attr_impl(x, name, silent) : KeyError: "Unable to open object (object 'X' doesn't exist)"

Furhtermore, when I run it with the default parameters I get the following error:

sce <- "{cr_sn}/jobs/out_merged/multiplexed1/outs/raw_feature_bc_matrix/" %>%
  glue() %>%
  here()%>%
  read10xCounts()

Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() got an unexpected keyword argument 'matrix'

Do you have any clue on what may be going on?

sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] SCrafty_0.1.0 glue_1.6.2 here_1.0.1
[4] patchwork_1.1.1 zellkonverter_1.6.3 DropletUtils_1.16.0
[7] scater_1.24.0 scran_1.24.0 scuttle_1.6.2
[10] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 Biobase_2.56.0
[13] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2 IRanges_2.30.0
[16] S4Vectors_0.34.0 BiocGenerics_0.42.0 MatrixGenerics_1.8.1
[19] matrixStats_0.62.0 scales_1.2.0 forcats_0.5.1
[22] stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
[25] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8
[28] ggplot2_3.3.6 tidyverse_1.3.2

loaded via a namespace (and not attached): [1] googledrive_2.0.0 ggbeeswarm_0.6.0 colorspace_2.0-3
[4] ellipsis_0.3.2 rprojroot_2.0.3 bluster_1.6.0
[7] XVector_0.36.0 BiocNeighbors_1.14.0 fs_1.5.2
[10] rstudioapi_0.13 bit64_4.0.5 ggrepel_0.9.1
[13] fansi_1.0.3 lubridate_1.8.0 xml2_1.3.3
[16] R.methodsS3_1.8.2 codetools_0.2-18 sparseMatrixStats_1.8.0
[19] knitr_1.39 jsonlite_1.8.0 broom_1.0.0
[22] cluster_2.1.2 dbplyr_2.2.1 png_0.1-7
[25] R.oo_1.25.0 HDF5Array_1.24.1 compiler_4.2.1
[28] httr_1.4.3 basilisk_1.8.0 dqrng_0.3.0
[31] backports_1.4.1 assertthat_0.2.1 Matrix_1.3-4
[34] fastmap_1.1.0 gargle_1.2.0 limma_3.52.2
[37] cli_3.3.0 BiocSingular_1.12.0 htmltools_0.5.3
[40] tools_4.2.1 rsvd_1.0.5 igraph_1.3.4
[43] gtable_0.3.0 GenomeInfoDbData_1.2.8 Rcpp_1.0.9
[46] cellranger_1.1.0 rhdf5filters_1.8.0 vctrs_0.4.1
[49] DelayedMatrixStats_1.18.0 xfun_0.31 beachmat_2.12.0
[52] rvest_1.0.2 lifecycle_1.0.1 irlba_2.3.5
[55] statmod_1.4.36 googlesheets4_1.0.0 edgeR_3.38.1
[58] basilisk.utils_1.8.0 zlibbioc_1.42.0 vroom_1.5.7
[61] hms_1.1.1 parallel_4.2.1 rhdf5_2.40.0
[64] yaml_2.3.5 reticulate_1.25 gridExtra_2.3
[67] stringi_1.7.8 ScaledMatrix_1.4.0 filelock_1.0.2
[70] BiocParallel_1.30.3 rlang_1.0.4 pkgconfig_2.0.3
[73] bitops_1.0-7 evaluate_0.15 lattice_0.20-45
[76] Rhdf5lib_1.18.2 bit_4.0.4 cowplot_1.1.1
[79] tidyselect_1.1.2 magrittr_2.0.3 R6_2.5.1
[82] generics_0.1.3 metapod_1.4.0 DelayedArray_0.22.0
[85] DBI_1.1.3 pillar_1.8.0 haven_2.5.0
[88] withr_2.5.0 RCurl_1.98-1.8 dir.expiry_1.4.0
[91] modelr_0.1.8 crayon_1.5.1 utf8_1.2.2
[94] tzdb_0.3.0 rmarkdown_2.14 viridis_0.6.2
[97] locfit_1.5-9.6 grid_4.2.1 readxl_1.4.0
[100] reprex_2.0.1 digest_0.6.29 R.utils_2.12.0
[103] munsell_0.5.0 viridisLite_0.4.0 beeswarm_0.4.0
[106] vipor_0.4.5

PeteHaitch commented 2 years ago

I think you might be confused by the various file formats:

HDF5 (.h5) is a general file format. Different tools may produce HDF5 files with different structures. Within Bioconductor, the tools for reading these are in the rhdf5 package (e.g., rhdf5::h5read()).
CellRanger produces some HDF5 files that have CellRanger-specific structure. Within Bioconductor, the preferred tool for reading these are in the DropletUtils package (e.g., DropletUtils::read10xCounts()) or the HDF5Array package (e.g., HDF5Array::TENxMatrix()). The choice depends on a bit on what you're wanting to do downstream, but for most analyses of scRNA-seq data you will probably be wanting DropletUtils::read10xCounts().
H5AD (.h5ad) are HDF5 files with specific structure that were originally designed for storing the Python-based AnnData data structure. Within Bioconductor, the preferred tool for reading these are in the zellkonverter package (e.g., zellkonverter::readH5AD())

It appear you want to read the CellRanger outputs in R (these aren't H5AD files), for which you should be using an option from (2), i.e., something like:

# NOTE: Skipping all the piping-stuff.
library(DropletUtils)
library(here)
sce <- read10xCounts(here(cr_sn, "jobs/out_merged/multiplexed1/outs/raw_feature_bc_matrix.h5"))

Note that DropletUtils::read10xCounts() requires the path to the .h5 file (not the directory containing it) when using the CellRanger-HDF5 files as input (see help("read10xCounts", "DropletUtils") for details).

MarcElosua commented 2 years ago

@PeteHaitch Thank you so much for the explanation! I wasn't aware of the different structures within HDF5 files!

theislab / zellkonverter

readH5AD issue with large dataset #69