stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
328 stars 88 forks source link

FeatureMatrix returns empty matrix after merging atac objects #1228

Closed bio-la closed 2 years ago

bio-la commented 2 years ago

hi, thanks for the great tool! similarly to what was reported in #205 #224 i found that FeatureMatrix returns an empty count matrix after merging the peaks from 3 atac runs.

I am using the Teaseq dataset which provides, for each sample atac modality, an h5 with Peaks and GEX, a metadata file, and a compressed fragment file, with no index.

I generated the index using tabix, and i follow the merging vignette

I went down the rabbit hole until GetCellsInRegion which calls reads <- scanTabix(file = tabix, param = region) for chunks of 1900 regions at a time, finding that reads returns character(0) at every assayed region. I am not sure how to interpret or to fix this. The cell barcodes in the fragment file are definitely the ones assayed in both Peaks and GEX (only, the fragment file is indexed using a hash of the classic 10x barcode, so have to play around with the metadata file )

here is the code i am using

library(Seurat)
library(Signac)
library(GenomicRanges)
setwd("~/Documents/devel/data_test.dir/GSE158013_TEASEQ_RAW/")
sample1 <- Read10X_h5("GSM5123951_X066MP0C1W3_cellranger-arc_filtered_feature_bc_matrix.h5")
sample2 <- Read10X_h5("GSM5123952_X066-MP0C1W4_leukopak_perm-cells_tea_200M_cellranger-arc_filtered_feature_bc_matrix.h5")
sample3 <- Read10X_h5("GSM5123953_X066-MP0C1W5_leukopak_perm-cells_tea_200M_cellranger-arc_filtered_feature_bc_matrix.h5")

getPeaks <- function (data) {
  chr<-sapply(strsplit(rownames(data),":"),"[[",1)
  start<-sapply(strsplit(sapply(strsplit(rownames(data),":"),"[[",2),"-"),"[[",1)
  end<-sapply(strsplit(sapply(strsplit(rownames(data),":"),"[[",2),"-"),"[[",2)
  df<- data.frame(chromosome=chr, start=start, end=end)
  gr<- makeGRangesFromDataFrame(df)
  return (gr)
}

gr.s1 <- getPeaks(data= sample1$Peaks)
gr.s2 <- getPeaks(data= sample2$Peaks)
gr.s3 <- getPeaks(data= sample3$Peaks)

# Create a unified set of peaks to quantify in each dataset
combined.peaks <- reduce(x = c(gr.s1,gr.s2, gr.s3))

# Filter out bad peaks based on length
peakwidths <- width(combined.peaks)
hist(peakwidths, breaks = 100)
length(combined.peaks) # 110111
length(combined.peaks[peakwidths  < 10000 & peakwidths > 20])
combined.peaks = combined.peaks[peakwidths  < 10000 & peakwidths > 20]

#read metadata

s1meta <- read.table("GSM5123951_X066-MP0C1W3_atac_filtered_metadata.csv.gz", sep=",", header=T)

length(unique(s1meta$barcodes))
dim(s1meta)
#7250   22
frags.s1 <- CreateFragmentObject(
  path = "GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_fragments.tsv.gz",
  cells = s1meta$barcodes, 
  validate.fragments = T,
  max.lines = NULL,
  tolerance = 0 
)
s1.counts <-  FeatureMatrix(
  fragments = frags.s1,
  features = combined.peaks,
  cells = s1meta$barcodes
)
range(rowSums(s1.counts))
# 0 0 

Thanks for your help!

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1  IRanges_2.28.0       S4Vectors_0.32.4     BiocGenerics_0.40.0  sp_1.5-0            
 [7] SeuratObject_4.1.2   Seurat_4.1.0         Signac_1.8.0         magrittr_2.0.3       datapasta_3.1.0      forcats_0.5.1       
[13] stringr_1.4.0        dplyr_1.0.10         purrr_0.3.4          readr_2.1.2          tidyr_1.2.1          tibble_3.1.8        
[19] ggplot2_3.3.6        tidyverse_1.3.1     

loaded via a namespace (and not attached):
  [1] readxl_1.4.0           backports_1.4.1        fastmatch_1.1-3        plyr_1.8.7             igraph_1.3.0          
  [6] lazyeval_0.2.2         splines_4.1.2          BiocParallel_1.28.3    listenv_0.8.0          scattermore_0.8       
 [11] digest_0.6.29          htmltools_0.5.2        fansi_1.0.3            tensor_1.5             cluster_2.1.3         
 [16] ROCR_1.0-11            tzdb_0.3.0             globals_0.16.1         Biostrings_2.62.0      modelr_0.1.8          
 [21] matrixStats_0.62.0     spatstat.sparse_2.1-0  colorspace_2.0-3       rvest_1.0.2            ggrepel_0.9.1         
 [26] haven_2.4.3            crayon_1.5.1           RCurl_1.98-1.6         jsonlite_1.8.0         progressr_0.11.0      
 [31] spatstat.data_2.1-4    survival_3.3-1         zoo_1.8-9              glue_1.6.2             polyclip_1.10-0       
 [36] gtable_0.3.1           zlibbioc_1.40.0        XVector_0.34.0         leiden_0.3.9           future.apply_1.9.1    
 [41] abind_1.4-5            scales_1.2.1           DBI_1.1.2              spatstat.random_2.2-0  miniUI_0.1.1.1        
 [46] Rcpp_1.0.9             viridisLite_0.4.1      xtable_1.8-4           reticulate_1.24        spatstat.core_2.4-2   
 [51] bit_4.0.4              htmlwidgets_1.5.4      httr_1.4.2             RColorBrewer_1.1-3     ellipsis_0.3.2        
 [56] ica_1.0-2              pkgconfig_2.0.3        uwot_0.1.11            deldir_1.0-6           dbplyr_2.1.1          
 [61] utf8_1.2.2             tidyselect_1.1.2       rlang_1.0.6            reshape2_1.4.4         later_1.3.0           
 [66] munsell_0.5.0          cellranger_1.1.0       tools_4.1.2            cli_3.4.1              generics_0.1.3        
 [71] broom_0.7.12           ggridges_0.5.3         fastmap_1.1.0          goftest_1.2-3          bit64_4.0.5           
 [76] fs_1.5.2               fitdistrplus_1.1-8     RANN_2.6.1             nlme_3.1-157           pbapply_1.5-0         
 [81] future_1.28.0          mime_0.12              RcppRoll_0.3.0         xml2_1.3.3             hdf5r_1.3.5           
 [86] compiler_4.1.2         rstudioapi_0.13        plotly_4.10.0          png_0.1-7              spatstat.utils_2.3-0  
 [91] reprex_2.0.1           stringi_1.7.8          rgeos_0.5-9            lattice_0.20-45        Matrix_1.5-1          
 [96] vctrs_0.4.1            pillar_1.8.1           lifecycle_1.0.2        spatstat.geom_2.4-0    lmtest_0.9-40         
[101] RcppAnnoy_0.0.19       data.table_1.14.2      cowplot_1.1.1          bitops_1.0-7           irlba_2.3.5           
[106] httpuv_1.6.5           patchwork_1.1.2        R6_2.5.1               promises_1.2.0.1       KernSmooth_2.23-20    
[111] gridExtra_2.3          parallelly_1.32.1      codetools_0.2-18       MASS_7.3-56            assertthat_0.2.1      
[116] withr_2.5.0            sctransform_0.3.3      Rsamtools_2.10.0       GenomeInfoDbData_1.2.7 mgcv_1.8-40           
[121] parallel_4.1.2         hms_1.1.1              rpart_4.1.16           grid_4.1.2             Rtsne_0.15            
[126] shiny_1.7.1            lubridate_1.8.0       
timoast commented 2 years ago

the fragment file is indexed using a hash of the classic 10x barcode, so have to play around with the metadata file

Can you clarify what you mean exactly?

I downloaded this data from GEO and was not able to reproduce this issue, my matrix contained many non-zero counts. However, the metadata file on GEO is called GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_metadata.csv.gz, did you modify the original file? If so please include all steps that you ran.

FYI, the function getPeaks here can be replaced with StringToGRanges() in Signac.

bio-la commented 2 years ago

thanks for your quick reply! this is a snapshot of the fragment file

chr1    10067   10197   69c31d5e3ef411ebb8b442010a19c80f    6
chr1    10067   10407   69c783b23ef411ebb8b442010a19c80f    4
chr1    10073   10203   69c5efa23ef411ebb8b442010a19c80f    12
chr1    10078   10321   69c27dfe3ef411ebb8b442010a19c80f    4
chr1    10084   10192   69c1a5003ef411ebb8b442010a19c80f    2
chr1    10085   10221   69c093183ef411ebb8b442010a19c80f    1
chr1    10085   10279   69c05e023ef411ebb8b442010a19c80f    2
chr1    10085   10309   69c3e4783ef411ebb8b442010a19c80f    2
chr1    10090   10272   69c9f9b23ef411ebb8b442010a19c80f    2
chr1    10091   10278   69c198ee3ef411ebb8b442010a19c80f    2

I only changed the name of the file. I am puzzled, did you run the same code i posted? i can't produce a count matrix for any of the downloaded samples (i didn't modify anything of the other two ) can you share the code you used that worked for you? Thanks!

timoast commented 2 years ago

Here is a simplified example that produces a count matrix using the data from GEO:

library(Seurat)
library(Signac)
library(GenomicRanges)

setwd("~/issue_1228/")
s <- Read10X_h5("GSM5123952_X066-MP0C1W4_leukopak_perm-cells_tea_200M_cellranger-arc_filtered_feature_bc_matrix.h5")
s.meta <- read.table("GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_metadata.csv.gz", sep=",", header=T)
gr <- StringToGRanges(rownames(s$Peaks), sep = c(":", "-"))

frags.s <- CreateFragmentObject(
  path = "GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_fragments.tsv.gz",
  cells = s.meta$barcodes
)
s.counts <-  FeatureMatrix(
  fragments = frags.s,
  features = head(gr, 1000),
  cells = s.meta$barcodes
)
range(rowSums(s.counts))
# 0 22934

I'm still unclear on what you mean by this:

the fragment file is indexed using a hash of the classic 10x barcode, so have to play around with the metadata file

bio-la commented 2 years ago

thanks, I have tried running this code with the original metadata file but i still get 0 counts. what version of Signac/Seurat/Granges are you running? (also, there is a typo and you are reading Read10X_h5("GSM5123952") but metadata and fragments from GSM5123951?)

Sorry for the poor choice of words. What I meant is that column 4 of the fragment file usually has the cellbarcode in the standard 10x format (or at least that's what i would expect), here the name of the cell resembles a hashed string. the cell 69c31d5e3ef411ebb8b442010a19c80f is CTAGGCGGTAGCTGCG-1

chr1    10067   10197   69c31d5e3ef411ebb8b442010a19c80f    6

by play around i mean use s.meta$barcodes instead of calling the colnames on the peak matrix

timoast commented 2 years ago

(also, there is a typo and you are reading Read10X_h5("GSM5123952") but metadata and fragments from GSM5123951?)

That was a typo yes, but the count matrix is only use here to define the genomic ranges being quantified so it doesn't matter.

Here's my session info:

R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stxKidney.SeuratData_0.1.0        pbmcsca.SeuratData_3.0.0          pbmcref.SeuratData_1.0.0         
 [4] pbmcMultiome.SeuratData_0.1.2     pbmc3k.SeuratData_3.1.4           panc8.SeuratData_3.0.2           
 [7] ifnb.SeuratData_3.1.0             hcabm40k.SeuratData_3.0.0         bmcite.SeuratData_0.3.0          
[10] SeuratData_0.2.1                  ggplot2_3.3.6                     BSgenome.Hsapiens.UCSC.hg38_1.4.4
[13] BSgenome_1.62.0                   rtracklayer_1.54.0                Biostrings_2.62.0                
[16] XVector_0.34.0                    EnsDb.Hsapiens.v86_2.99.0         ensembldb_2.18.2                 
[19] AnnotationFilter_1.18.0           GenomicFeatures_1.46.1            AnnotationDbi_1.56.2             
[22] Biobase_2.54.0                    GenomicRanges_1.46.1              GenomeInfoDb_1.30.1              
[25] IRanges_2.28.0                    S4Vectors_0.32.4                  BiocGenerics_0.40.0              
[28] sp_1.5-0                          SeuratObject_4.1.2                Seurat_4.1.1.9006                
[31] Signac_1.8.0                      testthat_3.1.4                   

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                  reticulate_1.26             tidyselect_1.1.2            RSQLite_2.2.16             
  [5] htmlwidgets_1.5.4           grid_4.1.3                  BiocParallel_1.28.3         Rtsne_0.16                 
  [9] munsell_0.5.0               codetools_0.2-18            ica_1.0-3                   future_1.28.0              
 [13] miniUI_0.1.1.1              withr_2.5.0                 spatstat.random_2.2-0       colorspace_2.0-3           
 [17] progressr_0.11.0            filelock_1.0.2              knitr_1.40                  rstudioapi_0.14            
 [21] ROCR_1.0-11                 tensor_1.5                  listenv_0.8.0               MatrixGenerics_1.6.0       
 [25] GenomeInfoDbData_1.2.7      polyclip_1.10-0             bit64_4.0.5                 farver_2.1.1               
 [29] parallelly_1.32.1           vctrs_0.4.1                 generics_0.1.3              xfun_0.33                  
 [33] BiocFileCache_2.2.0         R6_2.5.1                    hdf5r_1.3.5                 bitops_1.0-7               
 [37] spatstat.utils_2.3-1        cachem_1.0.6                DelayedArray_0.20.0         assertthat_0.2.1           
 [41] promises_1.2.0.1            BiocIO_1.4.0                scales_1.2.1                rgeos_0.5-9                
 [45] gtable_0.3.1                globals_0.16.1              goftest_1.2-3               rlang_1.0.5                
 [49] RcppRoll_0.3.0              splines_4.1.3               lazyeval_0.2.2              spatstat.geom_2.4-0        
 [53] yaml_2.3.5                  reshape2_1.4.4              abind_1.4-5                 httpuv_1.6.6               
 [57] tools_4.1.3                 ellipsis_0.3.2              spatstat.core_2.4-4         RColorBrewer_1.1-3         
 [61] ggridges_0.5.3              Rcpp_1.0.9                  plyr_1.8.7                  progress_1.2.2             
 [65] zlibbioc_1.40.0             purrr_0.3.4                 RCurl_1.98-1.8              prettyunits_1.1.1          
 [69] rpart_4.1.16                deldir_1.0-6                pbapply_1.5-0               cowplot_1.1.1              
 [73] zoo_1.8-11                  SummarizedExperiment_1.24.0 ggrepel_0.9.1               cluster_2.1.3              
 [77] magrittr_2.0.3              data.table_1.14.2           scattermore_0.8             lmtest_0.9-40              
 [81] RANN_2.6.1                  ProtGenerics_1.26.0         fitdistrplus_1.1-8          matrixStats_0.62.0         
 [85] pkgload_1.3.0               evaluate_0.16               hms_1.1.2                   patchwork_1.1.2            
 [89] mime_0.12                   xtable_1.8-4                XML_3.99-0.10               gridExtra_2.3              
 [93] compiler_4.1.3              biomaRt_2.50.0              tibble_3.1.8                KernSmooth_2.23-20         
 [97] crayon_1.5.1                htmltools_0.5.3             mgcv_1.8-40                 later_1.3.0                
[101] tidyr_1.2.0                 DBI_1.1.3                   tweenr_2.0.1                dbplyr_2.2.1               
[105] MASS_7.3-56                 rappdirs_0.3.3              Matrix_1.5-1                brio_1.1.2                 
[109] cli_3.3.0                   parallel_4.1.3              igraph_1.3.4                pkgconfig_2.0.3            
[113] GenomicAlignments_1.30.0    plotly_4.10.0               spatstat.sparse_2.1-1       xml2_1.3.3                 
[117] stringr_1.4.1               digest_0.6.29               sctransform_0.3.4           RcppAnnoy_0.0.19           
[121] spatstat.data_2.2-0         rmarkdown_2.15              leiden_0.4.3                fastmatch_1.1-3            
[125] uwot_0.1.14                 restfulr_0.0.15             curl_4.3.2                  shiny_1.7.2                
[129] Rsamtools_2.10.0            rjson_0.2.21                lifecycle_1.0.1             nlme_3.1-159               
[133] jsonlite_1.8.0              viridisLite_0.4.1           fansi_1.0.3                 pillar_1.8.1               
[137] lattice_0.20-45             KEGGREST_1.34.0             fastmap_1.1.0               httr_1.4.4                 
[141] survival_3.4-0              glue_1.6.2                  png_0.1-7                   bit_4.0.4                  
[145] ggforce_0.3.4               stringi_1.7.8               blob_1.2.2                  memoise_2.0.1              
[149] dplyr_1.0.10                irlba_2.3.5                 future.apply_1.9.1
bio-la commented 2 years ago

apart from your Seurat_4.1.1.9006 i don't see discrepancies in the tool versions. I am running R 4.1.2 vs your 4.1.3 but i don't think this is a major issue. I am at a loss for an explanation here. I will try to create a minimal conda env with these precise packages versions (Signac, Granges, Seurat) and R version and check again.

One last thing, can you post the command you used to index the fragment file? I don't get any error with my tbi indices, just to leave no stone unturned...

thank you very much for you patience and support with this.

timoast commented 2 years ago

tabix -p bed <filename>

You could try re-downloading the files from GEO

bio-la commented 2 years ago

downloaded files anew and recreated the tbi indices. No issue this time! I must have corrupted some of those moving from one directory to another.

thank you very much again for your help, I will close the issue now.