stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
330 stars 88 forks source link

Dimensions of feature matrix do not match length of peaks granges #1754

Closed jenellewallace closed 3 months ago

jenellewallace commented 4 months ago

For my dataset I've found that the dimensions of my feature matrix do not match the length of the granges object I used to make it. I cannot figure out why based on the documentation - is any filtering done at this step behind the scenes? I did notice that I had some ranges in my granges object that were exact duplicates (I created the peak set with some custom code because I am interested in consensus peaks across multiple species and didn't realize I had that issue) which I thought might cause this problem, but I removed the duplicates and still the numbers don't quite match. Thanks for any insight!

mcols(all_peaks)$range_id <- paste0(seqnames(all_peaks), "-", start(all_peaks), "-", end(all_peaks))
range_id_table <- table(mcols(all_peaks)$range_id)
duplicate_range_ids <- names(range_id_table[range_id_table > 1])
all_duplicate_indices <- which(mcols(all_peaks)$range_id %in% duplicate_range_ids)
all_peaks_cleaned = all_peaks[-all_duplicate_indices]
macs2_counts <- FeatureMatrix(fragments = frags_list[1],features = all_peaks_cleaned,cells = cells[1:25]);
Extracting reads overlapping genomic regions
> dim(macs2_counts)
[1] 1089998      25
> length(all_peaks_cleaned)
[1] 1090084
> length(unique(all_peaks_cleaned$range_id))
[1] 1090084
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] grid      stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] future_1.33.2                     biovizBase_1.52.0                 BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.72.0                   BiocIO_1.14.0                    
 [6] Biostrings_2.72.1                 XVector_0.44.0                    EnsDb.Hsapiens.v86_2.99.0         ensembldb_2.28.0                  AnnotationFilter_1.28.0          
[11] GenomicFeatures_1.56.0            AnnotationDbi_1.66.0              lubridate_1.9.3                   forcats_1.0.0                     purrr_1.0.2                      
[16] readr_2.1.5                       tibble_3.2.1                      tidyverse_2.0.0                   Matrix_1.7-0                      SingleCellExperiment_1.26.0      
[21] SummarizedExperiment_1.34.0       Biobase_2.64.0                    MatrixGenerics_1.16.0             matrixStats_1.3.0                 stringr_1.5.1                    
[26] data.table_1.15.4                 tidyr_1.3.1                       progress_1.2.3                    Gviz_1.48.0                       BiocParallel_1.38.0              
[31] ggplot2_3.5.1                     rtracklayer_1.64.0                GenomicRanges_1.56.1              GenomeInfoDb_1.40.1               IRanges_2.38.0                   
[36] S4Vectors_0.42.0                  BiocGenerics_0.50.0               dplyr_1.1.4                       Signac_1.13.9003                  Seurat_5.1.0                     
[41] SeuratObject_5.0.2                sp_2.1-4                         

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.36.0      spatstat.sparse_3.1-0    bitops_1.0-7             httr_1.4.7               RColorBrewer_1.1-3       tools_4.4.0             
  [7] sctransform_0.4.1        backports_1.5.0          utf8_1.2.4               R6_2.5.1                 lazyeval_0.2.2           uwot_0.2.2              
 [13] withr_3.0.0              prettyunits_1.2.0        gridExtra_2.3            progressr_0.14.0         cli_3.6.3                spatstat.explore_3.2-7  
 [19] fastDummies_1.7.3        spatstat.data_3.1-2      ggridges_0.5.6           pbapply_1.7-2            Rsamtools_2.20.0         foreign_0.8-86          
 [25] dichromat_2.0-0.1        parallelly_1.37.1        rstudioapi_0.16.0        RSQLite_2.3.7            generics_0.1.3           ica_1.0-3               
 [31] spatstat.random_3.2-3    interp_1.1-6             fansi_1.0.6              abind_1.4-5              lifecycle_1.0.4          yaml_2.3.8              
 [37] SparseArray_1.4.8        BiocFileCache_2.12.0     Rtsne_0.17               blob_1.2.4               promises_1.3.0           crayon_1.5.3            
 [43] miniUI_0.1.1.1           lattice_0.22-6           cowplot_1.1.3            KEGGREST_1.44.1          pillar_1.9.0             knitr_1.47              
 [49] rjson_0.2.21             future.apply_1.11.2      codetools_0.2-20         fastmatch_1.1-4          leiden_0.4.3.1           glue_1.7.0              
 [55] vctrs_0.6.5              png_0.1-8                spam_2.10-0              gtable_0.3.5             cachem_1.1.0             xfun_0.45               
 [61] S4Arrays_1.4.1           mime_0.12                survival_3.7-0           RcppRoll_0.3.0           fitdistrplus_1.1-11      ROCR_1.0-11             
 [67] nlme_3.1-165             bit64_4.0.5              filelock_1.0.3           RcppAnnoy_0.0.22         irlba_2.3.5.1            KernSmooth_2.23-24      
 [73] rpart_4.1.23             colorspace_2.1-0         DBI_1.2.3                Hmisc_5.1-3              nnet_7.3-19              tidyselect_1.2.1        
 [79] bit_4.0.5                compiler_4.4.0           curl_5.2.1               httr2_1.0.1              htmlTable_2.4.2          xml2_1.3.6              
 [85] DelayedArray_0.30.1      plotly_4.10.4            checkmate_2.3.1          scales_1.3.0             lmtest_0.9-40            rappdirs_0.3.3          
 [91] digest_0.6.36            goftest_1.2-3            spatstat.utils_3.0-5     rmarkdown_2.27           htmltools_0.5.8.1        pkgconfig_2.0.3         
 [97] jpeg_0.1-10              base64enc_0.1-3          dbplyr_2.5.0             fastmap_1.2.0            rlang_1.1.4              htmlwidgets_1.6.4       
[103] UCSC.utils_1.0.0         shiny_1.8.1.1            zoo_1.8-12               jsonlite_1.8.8           VariantAnnotation_1.50.0 RCurl_1.98-1.14         
[109] magrittr_2.0.3           Formula_1.2-5            GenomeInfoDbData_1.2.12  dotCall64_1.1-1          patchwork_1.2.0          munsell_0.5.1           
[115] Rcpp_1.0.12              reticulate_1.38.0        stringi_1.8.4            zlibbioc_1.50.0          MASS_7.3-60.0.1          plyr_1.8.9              
[121] parallel_4.4.0           listenv_0.9.1            ggrepel_0.9.5            deldir_2.0-4             splines_4.4.0            tensor_1.5              
[127] hms_1.1.3                igraph_2.0.3             spatstat.geom_3.2-9      RcppHNSW_0.6.0           reshape2_1.4.4           biomaRt_2.60.1          
[133] XML_3.99-0.17            evaluate_0.24.0          latticeExtra_0.6-30      renv_1.0.7               BiocManager_1.30.23      tzdb_0.4.0              
[139] httpuv_1.6.15            RANN_2.6.1               polyclip_1.10-6          scattermore_1.2          xtable_1.8-4             restfulr_0.0.15         
[145] RSpectra_0.16-1          later_1.3.2              viridisLite_0.4.2        memoise_2.0.1            GenomicAlignments_1.40.0 cluster_2.1.6           
[151] timechange_0.3.0         globals_0.16.3 
jenellewallace commented 4 months ago

This is the list of peaks that were excluded:

> excluded_peaks = setdiff(all_peaks_cleaned$range_id, rownames(macs2_counts))
> excluded_peaks
 [1] "chr1_2244X2270_random-16876-17376"   "chr1_1791X1813_random-31640-32140"   "chr1_2295X2322_random-31028-31528"   "chr1_2244X2270_random-2265-2765"    
 [5] "chr1_2244X2270_random-7894-8394"     "chr1_2244X2270_random-13297-13797"   "chr1_2244X2270_random-20176-20676"   "chr1_1791X1813_random-24407-24907"  
 [9] "chr1_1791X1813_random-20165-20665"   "chr1_2244X2270_random-4330-4830"     "chr1_2244X2270_random-8561-9061"     "chr1_2244X2270_random-19298-19798"  
[13] "chr5_2859X2896_random-139511-140011" "chr5_2859X2896_random-133900-134400" "chr5_2859X2896_random-88847-89347"   "chr5_2859X2896_random-72947-73447"  
[17] "chr5_2859X2896_random-64104-64604"   "chr5_2859X2896_random-47989-48489"   "chr5_2859X2896_random-47161-47661"   "chr5_2859X2896_random-12460-12960"  
[21] "chr5_1808X1830_random-2653-3153"     "chr5_1808X1830_random-2064-2564"     "chr5_2859X2896_random-113066-113566" "chr8_2057X2081_random-13865-14365"  
[25] "chr8_2057X2081_random-10998-11498"   "chr8_2057X2081_random-7945-8445"     "chr8_2057X2081_random-11764-12264"   "chr8_2057X2081_random-12431-12931"  
[29] "chr7_1604X1624_random-8651-9151"     "chr7_1604X1624_random-12606-13106"   "chr7_1604X1624_random-13988-14488"   "chr7_1604X1624_random-11363-11863"  
[33] "chr7_1604X1624_random-25349-25849"   "chr7_1604X1624_random-13176-13676"   "chr7_1604X1624_random-11939-12439"   "chr7_1604X1624_random-16427-16927"  
[37] "chr7_1604X1624_random-7978-8478"     "chr7_1604X1624_random-6789-7289"     "chr7_1604X1624_random-10118-10618"   "chr7_1604X1624_random-17397-17897"  
[41] "chr7_1604X1624_random-28097-28597"   "chr7_1604X1624_random-7361-7861"     "chr1_2244X2270_random-3711-4211"     "chr1_2244X2270_random-7590-8090"    
[45] "chr1_2244X2270_random-14545-15045"   "chr1_2244X2270_random-17086-17586"   "chr1_2244X2270_random-22125-22625"   "chr1_2244X2270_random-24775-25275"  
[49] "chr1_1791X1813_random-24507-25007"   "chr1_2244X2270_random-19250-19750"   "chr1_2244X2270_random-8614-9114"     "chr1_2295X2322_random-30844-31344"  
[53] "chr1_1791X1813_random-20241-20741"   "chr1_2244X2270_random-13173-13673"   "chr1_2244X2270_random-1978-2478"     "chr1_1791X1813_random-33068-33568"  
[57] "chr1_2244X2270_random-20182-20682"   "chr1_2244X2270_random-4440-4940"     "chr1_2244X2270_random-12644-13144"   "chr1_2244X2270_random-2532-3032"    
[61] "chr5_2859X2896_random-131444-131944" "chr5_2859X2896_random-83068-83568"   "chr5_2859X2896_random-29745-30245"   "chr5_2859X2896_random-27762-28262"  
[65] "chr5_2859X2896_random-113086-113586" "chr4_1536X1556_random-10954-11454"   "chr4_1536X1556_random-9316-9816"     "chr4_1536X1556_random-13230-13730"  
[69] "chr8_2057X2081_random-10705-11205"   "chr8_2057X2081_random-8037-8537"     "chr8_2057X2081_random-11464-11964"   "chr8_2057X2081_random-12372-12872"  
[73] "chr17_2196X2221_random-27780-28280"  "chr7_1604X1624_random-15644-16144"   "chr7_1604X1624_random-25508-26008"   "chr7_1604X1624_random-17347-17847"  
[77] "chr7_1604X1624_random-16369-16869"   "chr7_1604X1624_random-8002-8502"     "chr7_1604X1624_random-14194-14694"   "chr7_1604X1624_random-6933-7433"    
[81] "chr7_1604X1624_random-10913-11413"   "chr7_1604X1624_random-9045-9545"     "chr7_1604X1624_random-13230-13730"   "chr7_1604X1624_random-11744-12244"  
[85] "chr7_1604X1624_random-10038-10538"   "chr7_1604X1624_random-12506-13006"  

So it appears that the peaks that could not be parsed correctly (underscore in the chromosome name) were excluded. If I want to keep these peaks (in this example, these peaks are on standard chromosomes in other species in my dataset), how would I adjust the chromosome naming?

timoast commented 3 months ago

Any peaks that are on chromosomes that are not present in the fragment file will not be included in the resulting matrix. We could include these in the matrix and fill with zeros, but I am hesitant to do that in case it leads to misleading results when for some technical reason a chromosome is not present in a fragment file. In my opinion the current behaviour is better, as it reflects the reality: these peaks are simply not present in the data, so we don't provide any quantification for them. I agree that we can be clearer about this in the documentation for FeatureMatrix, and I will update it accordingly

jenellewallace commented 3 months ago

Makes sense, thank you!!