Issue loading H5AD distributed by Theis lab

GabrielHoffman commented 2 years ago

Here is an error report for a special case, but I figure it is part of a more general issue...

I look a quick look at the Human Lung Cell Atlas from the Theis group, but I had an issue loading it with zellkonverter:

URL: https://beta.fastgenomics.org/datasets/detail-dataset-427f1eee6dd44f50bae1ab13f0f3c6a9#Files

File: HLCA_v1_extended_no_counts.h5ad

Error:

> library(zellkonverter)
> 
> file = "HLCA_v1_extended_no_counts.h5ad"
> sce = readH5AD(file, use_hdf5=TRUE)   
Error in FUN(X[[i]], ...) : 
  each list element in the supplied 'dimnames' must be NULL or a
  character vector

There error is the same for zellkonverter 1.7.0 + R 4.2.0 and zellkonverter 1.6.0 + R 4.1.0.

Traceback:

> traceback()
17: stop(wmsg("each list element in the supplied 'dimnames' ", "must be NULL or a character vector"))
16: FUN(X[[i]], ...)
15: lapply(setNames(seq_len(ndim), names(dimnames)), function(along) {
        dn <- dimnames[[along]]
        if (is.null(dn)) 
            return(dn)
        if (!(is.vector(dn) && is.atomic(dn) || is.factor(dn))) 
            stop(wmsg("each list element in the supplied 'dimnames' ", 
                "must be NULL or a character vector"))
        if (!is.character(dn)) 
            dn <- as.character(dn)
        if (length(dn) != dim[[along]]) 
            stop(wmsg("length of 'dimnames[[", along, "]]' ", "(", 
                length(dn), ") must equal the ", "array extent (", 
                dim[[along]], ")"))
        dn
    })
14: lapply(setNames(seq_len(ndim), names(dimnames)), function(along) {
        dn <- dimnames[[along]]
        if (is.null(dn)) 
            return(dn)
        if (!(is.vector(dn) && is.atomic(dn) || is.factor(dn))) 
            stop(wmsg("each list element in the supplied 'dimnames' ", 
                "must be NULL or a character vector"))
        if (!is.character(dn)) 
            dn <- as.character(dn)
        if (length(dn) != dim[[along]]) 
            stop(wmsg("length of 'dimnames[[", along, "]]' ", "(", 
                length(dn), ") must equal the ", "array extent (", 
                dim[[along]], ")"))
        dn
    })
13: normarg_dimnames(dimnames, seed_dim)
12: new_DelayedSetDimnames(x@seed, dimnames)
11: stash_DelayedSetDimnames(x, value)
10: `dimnames<-`(`*tmp*`, value = dn)
9: `dimnames<-`(`*tmp*`, value = dn)
8: `rownames<-`(`*tmp*`, value = adata$var_names$to_list())
7: AnnData2SCE(adata, X_name = X_name, hdf5_backed = backed, verbose = verbose, 
       ...)
6: (function (file, X_name = NULL, backed = FALSE, verbose = NULL, 
       ...) 
   {
       .ui_info("Using the {.field Python} reader")
       anndata <- import("anndata")
       .ui_step("Reading {.file {.trim_path(file)}}", msg_done = "Read {.file {.trim_path(file)}}", 
           spinner = TRUE)
       adata <- anndata$read_h5ad(file, backed = if (backed) 
           "r"
       else FALSE)
       cli::cli_progress_done()
       AnnData2SCE(adata, X_name = X_name, hdf5_backed = backed, 
           verbose = verbose, ...)
   })(file = "/sc/arion/projects/CommonMind/hoffman/scRNAseq_data/HLCA/HLCA_v1_extended_no_counts.h5ad", 
       X_name = NULL, backed = TRUE, verbose = NULL)
5: do.call(.basilisk.fun, .basilisk.args)
4: evalq(do.call(.basilisk.fun, .basilisk.args), envir = proc, enclos = proc)
3: evalq(do.call(.basilisk.fun, .basilisk.args), envir = proc, enclos = proc)
2: basiliskRun(env = env, fun = .H5ADreader, file = file, X_name = X_name, 
       backed = use_hdf5, verbose = verbose, ...)
1: readH5AD(file, use_hdf5 = TRUE)

```r sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /hpc/packages/minerva-centos7/intel/parallel_studio_xe_2019/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] zellkonverter_1.7.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.8 XVector_0.32.0 [3] GenomicRanges_1.44.0 BiocGenerics_0.38.0 [5] zlibbioc_1.38.0 IRanges_2.26.0 [7] here_1.0.1 lattice_0.20-44 [9] GenomeInfoDb_1.28.4 tools_4.1.0 [11] SummarizedExperiment_1.22.0 parallel_4.1.0 [13] grid_4.1.0 rhdf5_2.36.0 [15] Biobase_2.52.0 png_0.1-7 [17] HDF5Array_1.20.0 cli_3.2.0 [19] basilisk_1.4.0 matrixStats_0.62.0 [21] rprojroot_2.0.2 Matrix_1.4-0 [23] GenomeInfoDbData_1.2.6 dir.expiry_1.0.0 [25] Rhdf5lib_1.14.0 rhdf5filters_1.4.0 [27] S4Vectors_0.30.2 bitops_1.0-7 [29] basilisk.utils_1.4.0 RCurl_1.98-1.3 [31] SingleCellExperiment_1.14.1 DelayedArray_0.18.0 [33] compiler_4.1.0 filelock_1.0.2 [35] MatrixGenerics_1.4.3 stats4_4.1.0 [37] jsonlite_1.8.0 reticulate_1.24 ``` <\details>

lazappi commented 2 years ago

This turned out to be a fairly simple fix so should be working now in release (v1.6.2) and devel (v1.7.2)

GabrielHoffman commented 2 years ago

The new version doesn't crash, but it also doesn't import the data. The data appears to have 0 rows:

> sce = readH5AD(file, use_hdf5=TRUE)  
Warning message:
The names of these selected obs columns have been modified to match R conventions: '3'_or_5'' -> 'X3._or_5.' 

> sce
class: SingleCellExperiment 
dim: 0 2232536 
metadata(1): ann_level_5_colors
assays(1): X
rownames(0):
rowData names(0):
colnames(2232536): TTAGGACTCTGCCCTA-WSSS8062679_meyer
  CCTATTAGTATAATGG_IPF1_tsukui ... ACGGCCATCCACTGGG_9_liao
  P3_6_TGAGCCGAGTGTCCCG
colData names(82): sample study_long ...
  ext_transf_ann_level_5_thresholded ext_transf_uncert_level_5
reducedDimNames(2): X_scanvi_emb X_umap
mainExpName: NULL
altExpNames(0):

> assay(sce,1)
<0 x 2232536> sparse matrix of class DelayedMatrix and type "double"

```r sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /hpc/packages/minerva-centos7/intel/parallel_studio_xe_2019/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices datasets utils [8] methods base other attached packages: [1] SingleCellExperiment_1.14.1 SummarizedExperiment_1.22.0 [3] Biobase_2.52.0 GenomicRanges_1.44.0 [5] GenomeInfoDb_1.28.4 IRanges_2.26.0 [7] S4Vectors_0.30.2 BiocGenerics_0.38.0 [9] MatrixGenerics_1.4.3 matrixStats_0.62.0 [11] zellkonverter_1.7.2 devtools_2.4.1 [13] usethis_2.0.1 loaded via a namespace (and not attached): [1] reticulate_1.24 remotes_2.4.0 HDF5Array_1.20.0 [4] purrr_0.3.4 lattice_0.20-44 rhdf5_2.36.0 [7] basilisk.utils_1.4.0 testthat_3.0.2 rlang_1.0.2 [10] pkgbuild_1.2.0 glue_1.6.2 withr_2.5.0 [13] sessioninfo_1.1.1 GenomeInfoDbData_1.2.6 lifecycle_1.0.1 [16] zlibbioc_1.38.0 memoise_2.0.0 callr_3.7.0 [19] fastmap_1.1.0 ps_1.6.0 curl_4.3.2 [22] Rcpp_1.0.8 filelock_1.0.2 cachem_1.0.6 [25] DelayedArray_0.18.0 desc_1.3.0 pkgload_1.2.1 [28] jsonlite_1.8.0 XVector_0.32.0 fs_1.5.0 [31] basilisk_1.4.0 dir.expiry_1.0.0 png_0.1-7 [34] processx_3.5.2 rprojroot_2.0.2 grid_4.1.0 [37] rhdf5filters_1.4.0 here_1.0.1 cli_3.2.0 [40] tools_4.1.0 bitops_1.0-7 magrittr_2.0.2 [43] RCurl_1.98-1.3 crayon_1.5.0 ellipsis_0.3.2 [46] Matrix_1.4-0 prettyunits_1.1.1 Rhdf5lib_1.14.0 [49] R6_2.5.1 compiler_4.1.0 ```

lazappi commented 2 years ago

Ah, that's definitely not what I intended. I clearly didn't look closely enough at the output. I think I know what the issue is so should be a quick fix. Thanks for letting me know!

lazappi commented 2 years ago

Actually I think this was correct. The file you are using doesn't contain any expression data so the dimensions are 2232536 cells by 0 features. This is what I get if I just load it in Python:

AnnData object with n_obs × n_vars = 2232536 × 0
    obs: 'sample', 'study_long', 'study', 'last_author_PI', 'subject_ID', 'sex', 'ethnicity', 'smoking_status', 'BMI', 'condition', 'subject_type', 'sample_type', 'single_cell_platform', "3'_or_5'", 'sequencing_platform', 'cell_ranger_version', 'fresh_or_frozen', 'dataset', 'anatomical_region_level_1', 'anatomical_region_level_2', 'anatomical_region_level_3', 'anatomical_region_highest_res', 'age', 'ann_highest_res', 'n_genes', 'size_factors', 'log10_total_counts', 'mito_frac', 'ribo_frac', 'original_ann_level_1', 'original_ann_level_2', 'original_ann_level_3', 'original_ann_level_4', 'original_ann_level_5', 'original_ann_nonharmonized', 'scanvi_label', 'leiden_1', 'leiden_2', 'leiden_3', 'anatomical_region_ccf_score', 'entropy_study_leiden_3', 'entropy_dataset_leiden_3', 'entropy_subject_ID_leiden_3', 'entropy_original_ann_level_1_leiden_3', 'entropy_original_ann_level_2_clean_leiden_3', 'entropy_original_ann_level_3_clean_leiden_3', 'entropy_original_ann_level_4_clean_leiden_3', 'entropy_original_ann_level_5_clean_leiden_3', 'leiden_4', 'reannotation_type', 'leiden_5', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'ann_coarse_for_GWAS_and_modeling', 'core_ann_level_1', 'core_ann_level_2', 'core_ann_level_3', 'core_ann_level_4', 'core_ann_level_5', 'cells_or_nuclei', 'HLCA_core_or_extension', 'anatomical_region_coarse_unharmonized', 'anatomical_region_detailed_unharmonized', 'ext_transf_ann_level_1', 'ext_transf_ann_level_1_thresholded', 'ext_transf_uncert_level_1', 'ext_transf_ann_level_2', 'ext_transf_ann_level_2_thresholded', 'ext_transf_uncert_level_2', 'ext_transf_ann_level_3', 'ext_transf_ann_level_3_thresholded', 'ext_transf_uncert_level_3', 'ext_transf_ann_level_4', 'ext_transf_ann_level_4_thresholded', 'ext_transf_uncert_level_4', 'ext_transf_ann_level_5', 'ext_transf_ann_level_5_thresholded', 'ext_transf_uncert_level_5'
    uns: 'ann_level_5_colors'
    obsm: 'X_scanvi_emb', 'X_umap'

There is another version of the dataset that contains expression data if you need that.

theislab / zellkonverter

Issue loading H5AD distributed by Theis lab #65

URL: https://beta.fastgenomics.org/datasets/detail-dataset-427f1eee6dd44f50bae1ab13f0f3c6a9#Files

File: HLCA_v1_extended_no_counts.h5ad