stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
327 stars 88 forks source link

TSSEnrichment Issue: 'x' must have 1 or more non-missing values #1506

Closed hcnapier closed 11 months ago

hcnapier commented 1 year ago

Discussed in https://github.com/stuart-lab/signac/discussions/1502

Originally posted by **hcnapier** October 4, 2023 Hello, I'm trying to run analysis on a 10x scATAC dataset using this code: ``` library(Signac) library(Seurat) library(GenomeInfoDb) library(EnsDb.Mmusculus.v79) library(ggplot2) library(patchwork) library(hdf5r) library(biovizBase) library(stringr) set.seed(1234) counts <- Read10X_h5("filtered_peak_bc_matrix.h5") metadata <- read.csv(file = "singlecell.csv", header = T, row.names = 1) pc_assay <- CreateChromatinAssay( counts = counts, sep = c(":", "-"), genome = "mm10", fragments = "fragments.tsv.gz", min.cells = 1 ) pc <- CreateSeuratObject( counts = pc_assay, assay = "peaks", project = "ATAC", meta.data = metadata ) annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79) head(annotations) seqlevelsStyle(annotations) <- 'UCSC' head(annotations) Annotation(pc) <- annotations pc <- TSSEnrichment(pc) ``` But when I get to TSSEnrichment I get this output: ``` Extracting TSS positions Extracting fragments at TSSs |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=29s Computing TSS enrichment score Error in ecdf(x = object$TSS.enrichment) : 'x' must have 1 or more non-missing values ``` I have checked to make sure the chrom name format is the same, as mentioned in [issue #834](https://github.com/stuart-lab/signac/issues/834): ``` > head(Fragments(pc)[[1]]) chrom start end barcode readCount 1 chr1 3000074 3000267 CCTGCTATCAAACCAC-1 2 2 chr1 3000139 3000209 GAGACTTGTAGAAAGG-1 2 3 chr1 3000144 3000386 GCGAGAAGTCTGGGAA-1 1 4 chr1 3000218 3000264 TGTGACAGTACGCAAG-1 1 5 chr1 3000225 3000442 ACAAACCAGCATGATA-1 1 6 chr1 3000225 3000580 GGAGTAGCAGGTCTGC-1 1 ``` ``` > Annotation(pc) GRanges object with 1763965 ranges and 5 metadata columns: seqnames ranges strand | tx_id gene_name gene_id gene_biotype | ENSMUSE00001236884 chr3 3508030-3508332 + | ENSMUST00000108393 Hnf4g ENSMUSG00000017688 protein_coding ENSMUSE00000676606 chr3 3634150-3634347 + | ENSMUST00000108394 Hnf4g ENSMUSG00000017688 protein_coding ENSMUSE00001345708 chr3 3638059-3638230 + | ENSMUST00000108393 Hnf4g ENSMUSG00000017688 protein_coding ENSMUSE00001345708 chr3 3638059-3638230 + | ENSMUST00000108394 Hnf4g ENSMUSG00000017688 protein_coding ENSMUSE00000149313 chr3 3641223-3641317 + | ENSMUST00000108393 Hnf4g ENSMUSG00000017688 protein_coding ... ... ... ... . ... ... ... ... ENSMUST00000082414 chrM 10167-11544 + | ENSMUST00000082414 mt-Nd4 ENSMUSG00000064363 protein_coding ENSMUST00000082418 chrM 11742-13565 + | ENSMUST00000082418 mt-Nd5 ENSMUSG00000064367 protein_coding ENSMUST00000082419 chrM 13552-14070 - | ENSMUST00000082419 mt-Nd6 ENSMUSG00000064368 protein_coding ENSMUST00000082421 chrM 14145-15288 + | ENSMUST00000082421 mt-Cytb ENSMUSG00000064370 protein_coding ENSMUST00000084013 chrM 9877-10173 + | ENSMUST00000084013 mt-Nd4l ENSMUSG00000065947 protein_coding type ENSMUSE00001236884 exon ENSMUSE00000676606 exon ENSMUSE00001345708 exon ENSMUSE00001345708 exon ENSMUSE00000149313 exon ... ... ENSMUST00000082414 cds ENSMUST00000082418 cds ENSMUST00000082419 cds ENSMUST00000082421 cds ENSMUST00000084013 cds ------- seqinfo: 22 sequences (1 circular) from mm10 genome ``` Additionally, I'm using GenomeInfoDb version 1.32.4, which I don't think should present issues. My first thought was that my data quality is poor and all of the cells with TSS enrichment are being sorted out, but that doesn't appear to be the case based on two tests: 1. I created a separate test seurat object with only cells that were filtered out in preprocessing: ``` raw_bx <- raw_counts@Dimnames[2][[1]] filtered_bx <- filtered_counts@Dimnames[2][[1]] filt_out_bx <- setdiff(raw_bx, filtered_bx) head(filt_out_bx) filt_out_counts <- raw_counts[,filt_out_bx] filt_out_assay <- CreateChromatinAssay( counts = counts, sep = c(":", "-"), genome = "mm10", fragments = "fragments.tsv.gz", min.cells = 1, validate.fragments = T, max.lines = NULL, tolerance = 0 ) filt_out <- CreateSeuratObject( counts = filt_out_assay, assay = "peaks", project = "ATAC", meta.data = metadata ) annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79) seqlevelsStyle(annotations) <- 'UCSC' head(annotations) Annotation(filt_out) <- annotations ``` Then I ran TSSEnrichment on this object and got the same error. 2. I ran a quick bedtools intersect with my total peaks.bed file and the tss.bed file from the 10X mm10-atac reference (the same reference I used for alignment in cellranger-atac count) and found many overlaps. ``` > bedtools intersect -a tss.bed -b peaks.bed | head ***** WARNING: File peaks.bed has inconsistent naming convention for record: GL456233.1 38740 39653 chr1 3466586 3466587 ENSMUSG00000089699 . + chr1 3671497 3671498 ENSMUSG00000051951 . - chr1 4785709 4785710 ENSMUSG00000033845 . - chr1 4785738 4785739 ENSMUSG00000033845 . - chr1 4807822 4807823 ENSMUSG00000025903 . + chr1 4807829 4807830 ENSMUSG00000025903 . + chr1 4807895 4807896 ENSMUSG00000025903 . + chr1 4807910 4807911 ENSMUSG00000025903 . + chr1 4857813 4857814 ENSMUSG00000033813 . + chr1 4858037 4858038 ENSMUSG00000033813 . + ``` Because I'm seeing TSS overlaps here, I believe that I should be seeing some TSS enrichment for one of my two mutually exclusive groups (filtered out or filtered in). Does anyone know where this issue is coming from or how I might be able to solve it? Any thoughts would be greatly appreciated! Here is my session info: ``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Ventura 13.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.1.1 stringr_1.5.0 biovizBase_1.44.0 hdf5r_1.3.8 [5] patchwork_1.1.2 ggplot2_3.4.2 EnsDb.Mmusculus.v79_2.99.0 ensembldb_2.20.2 [9] AnnotationFilter_1.20.0 GenomicFeatures_1.48.4 AnnotationDbi_1.58.0 Biobase_2.56.0 [13] SeuratObject_4.1.3 Seurat_4.3.0.1 Signac_1.10.0 GenomicRanges_1.48.0 [17] GenomeInfoDb_1.32.4 IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0 loaded via a namespace (and not attached): [1] utf8_1.2.2 spatstat.explore_3.0-5 reticulate_1.27 tidyselect_1.2.0 [5] RSQLite_2.2.17 htmlwidgets_1.5.4 grid_4.2.1 BiocParallel_1.30.3 [9] Rtsne_0.16 munsell_0.5.0 codetools_0.2-18 ica_1.0-3 [13] interp_1.1-3 future_1.30.0 miniUI_0.1.1.1 withr_2.5.0 [17] spatstat.random_3.0-1 colorspace_2.0-3 progressr_0.13.0 filelock_1.0.2 [21] knitr_1.40 rstudioapi_0.14 ROCR_1.0-11 tensor_1.5 [25] listenv_0.9.0 MatrixGenerics_1.8.1 GenomeInfoDbData_1.2.8 polyclip_1.10-0 [29] bit64_4.0.5 parallelly_1.33.0 vctrs_0.6.1 generics_0.1.3 [33] xfun_0.33 BiocFileCache_2.4.0 R6_2.5.1 bitops_1.0-7 [37] spatstat.utils_3.0-1 cachem_1.0.6 DelayedArray_0.22.0 assertthat_0.2.1 [41] promises_1.2.0.1 BiocIO_1.6.0 scales_1.2.1 nnet_7.3-17 [45] gtable_0.3.1 globals_0.16.2 goftest_1.2-3 rlang_1.1.0 [49] RcppRoll_0.3.0 splines_4.2.1 rtracklayer_1.56.1 lazyeval_0.2.2 [53] dichromat_2.0-0.1 checkmate_2.1.0 spatstat.geom_3.0-3 yaml_2.3.5 [57] reshape2_1.4.4 abind_1.4-5 backports_1.4.1 httpuv_1.6.6 [61] Hmisc_4.7-2 tools_4.2.1 ellipsis_0.3.2 RColorBrewer_1.1-3 [65] ggridges_0.5.4 Rcpp_1.0.9 plyr_1.8.7 base64enc_0.1-3 [69] progress_1.2.2 zlibbioc_1.42.0 purrr_0.3.4 RCurl_1.98-1.8 [73] prettyunits_1.1.1 rpart_4.1.16 deldir_1.0-6 pbapply_1.6-0 [77] cowplot_1.1.1 zoo_1.8-11 SummarizedExperiment_1.26.1 ggrepel_0.9.1 [81] cluster_2.1.4 magrittr_2.0.3 data.table_1.14.2 scattermore_0.8 [85] lmtest_0.9-40 RANN_2.6.1 ProtGenerics_1.28.0 fitdistrplus_1.1-8 [89] matrixStats_0.62.0 pkgload_1.3.0 hms_1.1.2 mime_0.12 [93] xtable_1.8-4 XML_3.99-0.10 jpeg_0.1-10 gridExtra_2.3 [97] compiler_4.2.1 biomaRt_2.52.0 tibble_3.2.1 KernSmooth_2.23-20 [101] crayon_1.5.1 htmltools_0.5.3 later_1.3.0 Formula_1.2-4 [105] tidyr_1.2.1 DBI_1.1.3 dbplyr_2.2.1 MASS_7.3-58.1 [109] rappdirs_0.3.3 Matrix_1.5-1 cli_3.4.1 parallel_4.2.1 [113] igraph_1.3.5 pkgconfig_2.0.3 GenomicAlignments_1.32.1 foreign_0.8-82 [117] sp_1.5-0 plotly_4.10.1 spatstat.sparse_3.0-0 xml2_1.3.3 [121] XVector_0.36.0 VariantAnnotation_1.42.1 digest_0.6.29 sctransform_0.3.5 [125] RcppAnnoy_0.0.20 spatstat.data_3.0-0 Biostrings_2.64.1 leiden_0.4.3 [129] fastmatch_1.1-3 htmlTable_2.4.1 uwot_0.1.14 restfulr_0.0.15 [133] curl_4.3.2 shiny_1.7.3 Rsamtools_2.12.0 rjson_0.2.21 [137] lifecycle_1.0.3 nlme_3.1-159 jsonlite_1.8.0 BSgenome_1.64.0 [141] viridisLite_0.4.1 fansi_1.0.3 pillar_1.8.1 lattice_0.20-45 [145] KEGGREST_1.36.3 fastmap_1.1.0 httr_1.4.4 survival_3.4-0 [149] glue_1.6.2 png_0.1-7 bit_4.0.4 stringi_1.7.8 [153] blob_1.2.3 latticeExtra_0.6-30 memoise_2.0.1 irlba_2.3.5 [157] future.apply_1.10.0 ```
timoast commented 1 year ago

Hi @hcnapier, not too sure what's causing the problem here. Can you show the output of traceback() run after you get the error?

hcnapier commented 1 year ago

Here's the output of traceback():

4: stop("'x' must have 1 or more non-missing values") 3: ecdf(x = object$TSS.enrichment) 2: TSSFast(object = object, assay = assay, tss.positions = tss.positions, process_n = process_n, verbose = verbose) 1: TSSEnrichment(pc)

timoast commented 11 months ago

Would you mind sharing the object that causes this issue, and I can take a closer look? You can email a link to stuartt@gis.a-star.edu.sg

hcnapier commented 11 months ago

I actually just found out that this problem was due to a problem with demultiplexing when the bcl files were converted to fastqs. I re-ran the demultiplexing using a different software and the issue resolved itself. Thanks for your willingness to help!