stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
328 stars 88 forks source link

Clarification about CallPeaks and group.by argument #1569

Closed AmosFong1 closed 9 months ago

AmosFong1 commented 10 months ago

Hello, I had a question about how CallPeaks handles merging of MACS2 peak calls grouped by metadata.

In your paper under DNA accessibility data processing writes: "Peak calling was performed for each cell type using the CallPeaks function in Signac, with group.by = ‘celltype’ to call peaks on each predicted cell type separately and combine the resulting peak calls across all cell types."

When you combine the resulting peak calls, do overlapping peaks get merged by genomic ranges reduce, disjoin or some other method?

When I ran CallPeaks(obj, group.by = "variable") on two Seurat objects, where one object is subset for one class of metadata "variable", and the other object contains the full dataset (multiple classes of metadata "variable)", I got two sets of peaks. When I converted the granges object to features, and ran intersection() between the two peaks sets, only ~ 80% of peaks derived from object were exact overlaps of the peaks derived from object 2.

I want to make sure that this is normal.

Thank you!

timoast commented 10 months ago

Hi, please see the documentation for CallPeaks under the parameter combine.peaks (TRUE by default):

Controls whether peak calls from different groups of cells are combined using GenomicRanges::reduce when calling peaks for different groups of cells (group.by parameter). If FALSE, a list of GRanges object will be returned. Note that metadata fields such as the p-value, q-value, and fold-change information for each peak will be lost if combining peaks.

AmosFong1 commented 10 months ago

Hi @timoast, thank you for the prompt reply!

I have a few follow-up questions about CallPeaks.

  1. If there is a list of fragment files associated with my object from merging multiple ChromatinAssays, can I still use group.by to subset the MACS2 calling? I merged multiple Seurat objects upstream, which resulted in a number of Fragment files.
  2. How does Merge + JoinLayers() affect the ChromatinAssay?
  3. I used FindOverlaps on the two sets of genomic ranges derived from the peak calling described above. I found that all the overlapping peaks derived from the subset Seurat object overlapped with genomic ranges derived from the original Seurat object (as expected, given genomicranges::reduce). However, there remain ~8000 extra peaks derived from the original Seurat object, that do not overlap the subset Seurat object, yet show non-zero counts in cells associated with the subset Seurat object? Can you provide some insight into why this may be the case?
timoast commented 10 months ago

If there is a list of fragment files associated with my object from merging multiple ChromatinAssays, can I still use group.by to subset the MACS2 calling? I merged multiple Seurat objects upstream, which resulted in a number of Fragment files.

Yes

How does Merge + JoinLayers() affect the ChromatinAssay?

JoinLayers() is not currently supported in ChromatinAssay objects

I used FindOverlaps on the two sets of genomic ranges derived from the peak calling described above. I found that all the overlapping peaks derived from the subset Seurat object overlapped with genomic ranges derived from the original Seurat object (as expected, given genomicranges::reduce). However, there remain ~8000 extra peaks derived from the original Seurat object, that do not overlap the subset Seurat object, yet show non-zero counts in cells associated with the subset Seurat object? Can you provide some insight into why this may be the case?

If I understand correctly, you have two objects and there are 8000 peaks called in one that are not in the subset object? My guess is that there were not enough fragments in these regions to identify them in the peak calling.

AmosFong1 commented 10 months ago

Hi @timoast,

Yes you have understood my third question exactly. What's strange, is that I used the group.by argument, which if I am understanding correctly, should run MACS2 for the subset object (lets say identity class 1), using the same fragments subsetted from the original object (lets say identity class 1, identity class 2). Is this concerning? I can tell that the correct fragments are being used in both cases because when I run Call peaks, it outputs keeping 0 cell barcodes for all fragment files that are irrelevant for the identity class.

timoast commented 10 months ago

There is a slight difference when setting group.by. When you use the group.by parameter the fragment files are split to only contain the cell barcodes present in that group. When you do not set the group.by parameter, no filtering is applied and the existing fragment files are used. If there are cells that you removed from the dataset (eg, through QC filtering) that have fragments present in the fragment file, these fragments will still be used in the peak calling. To fully remove them you'd need to create a new fragment file that only contains the fragments from the cells you keep.

AmosFong1 commented 10 months ago

@timoast What happens if there are redundant cell barcodes in between fragment files? Will group.by know how to look for sample specific barcodes in each individual fragment file?

I think this is the case in my dataset since I have merged 20 different sequencing runs (of course the barcodes have been made unique upon merging, but fragments remain the same 16 bp long string).

timoast commented 10 months ago

It's ok if there are the same cell barcodes across multiple fragment files, we take care of that in Signac by storing a dictionary of cell names in the fragment file to names in the object, so if you rename cells you don't need to change the fragment file

AmosFong1 commented 9 months ago

Hi @timoast, thank you for the prompt reply!

I have a few follow-up questions about CallPeaks.

  1. If there is a list of fragment files associated with my object from merging multiple ChromatinAssays, can I still use group.by to subset the MACS2 calling? I merged multiple Seurat objects upstream, which resulted in a number of Fragment files.
  2. How does Merge + JoinLayers() affect the ChromatinAssay?
  3. I used FindOverlaps on the two sets of genomic ranges derived from the peak calling described above. I found that all the overlapping peaks derived from the subset Seurat object overlapped with genomic ranges derived from the original Seurat object (as expected, given genomicranges::reduce). However, there remain ~8000 extra peaks derived from the original Seurat object, that do not overlap the subset Seurat object, yet show non-zero counts in cells associated with the subset Seurat object? Can you provide some insight into why this may be the case?

Hi @timoast, I am still confused about this discrepancy between # peaks detected. For both runs of Callpeaks I had used the same group.by argument, and for the subset object I subset for all of one class from the same group.by argument used in Callpeaks. Shouldn't the number of non-zero peaks in both the subset and original object be the same?

timoast commented 9 months ago

Can you show the full code you're running and output of sessionInfo()?

AmosFong1 commented 9 months ago

Hi Tim,

Thanks for your patience.

Below is the code:

# load packages
library(Azimuth)
library(BiocParallel)
library(BSgenome.Hsapiens.UCSC.hg38)
library(CellChat)
library(chromVAR)
library(circlize)
library(ComplexHeatmap)
library(DoubletFinder)
library(EnsDb.Hsapiens.v86)
library(GAMBLR.data)
library(ggalluvial)
library(ggbeeswarm)
library(ggpubr)
library(ggrepel)
library(grid)
library(harmony)
library(HCATonsilData)
library(httpgd)
library(JASPAR2024)
library(jsonlite)
library(languageserver)
library(Matrix)
library(patchwork)
library(RColorBrewer)
library(reticulate)
library(RSQLite)
library(rstatix)
library(scales)
library(Seurat)
library(SeuratData)
library(SeuratDisk)
library(SeuratObject)
library(Signac)
library(SLOcatoR)
library(SoupX)
library(TFBSTools)
library(tidyverse)
library(UpSetR)

# set project directory
project_dir <- "/projects/dscott_prj/amfong/Multiome_DZ/"

# set results directory
results_dir <- "/projects/dscott_scratch/amfong/Multiome_DZ/"

# load filepaths
filepaths <- read.table(paste0(project_dir, "data/metadata/filepaths_Multiome_DZ.csv"), sep = ",", header = TRUE)

# define functions
filter <- dplyr::filter
Matrix <- Matrix::Matrix
select <- dplyr::select

###### PART 1

# load seurat object
seurat_object <- readRDS(paste0(project_dir, "data/Rdata/Multiome_DZ/seurat_object_merged.rds"))

# set default assay
DefaultAssay(seurat_object) <- "ATAC"

# filter for cells that are either tumors or control
seurat_object <- subset(seurat_object, tumor_call | orig.ident %in% c("CLC02413", "CLC02414", "CLC02415", "CLC02506"))

# call peaks using macs2
peaks <- CallPeaks(seurat_object, macs2.path = "/home/amfong/miniconda3/envs/Multiome/bin/macs2", group.by = "orig.ident")

# remove features not found on BSgenome.Hsapiens.UCSC.hg38 standard chromosomes
peaks <- keepStandardChromosomes(peaks, pruning.mode = "coarse")

# remove features in genomic blacklist regions
peaks <- subsetByOverlaps(peaks, ranges = blacklist_hg38_unified, invert = TRUE)

# construct feature by cell matrix from genomic fragments file
macs2_counts <- FeatureMatrix(
    fragments = Fragments(seurat_object),
    features = peaks,
    cells = colnames(seurat_object)
)

# get annotations for hg38
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

# convert chromosome annotations to UCSC convention
seqlevels(annotations) <- paste0("chr", seqlevels(annotations))

# create chromatin assay using the macs2 counts
seurat_object[["peaks"]] <- CreateChromatinAssay(
    counts = macs2_counts,
    fragments = Fragments(seurat_object),
    annotation = annotations
)

# save seurat object
saveRDS(seurat_object, file = paste0(project_dir, "data/Rdata/scratch/seurat_object_tumors_merged_full.rds"))

###### PART 2

# load seurat object
seurat_object <- readRDS(paste0(project_dir, "data/Rdata/Multiome_DZ/seurat_object_merged.rds"))

# filter for cells that are tumors and are not from control samples
vals <- unique(seurat_object$orig.ident)
include <- vals[!vals %in% c("CLC02413", "CLC02414", "CLC02415", "CLC02506")]
seurat_object <- subset(seurat_object, tumor_call & orig.ident %in% include)

# set default assay
DefaultAssay(seurat_object) <- "ATAC"

# call peaks using macs2
peaks <- CallPeaks(seurat_object, macs2.path = "/home/amfong/miniconda3/envs/Multiome/bin/macs2", group.by = "orig.ident")

# remove features not found on BSgenome.Hsapiens.UCSC.hg38 standard chromosomes
peaks <- keepStandardChromosomes(peaks, pruning.mode = "coarse")

# remove features overlapping genomic blacklist regions
peaks <- subsetByOverlaps(peaks, ranges = blacklist_hg38_unified, invert = TRUE)

# create feature by cell matrix from genomic fragments file
macs2_counts <- FeatureMatrix(
    fragments = Fragments(seurat_object),
    features = peaks,
    cells = colnames(seurat_object)
)

# get annotations for hg38
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

# convert chromosome annotations to UCSC convention
seqlevels(annotations) <- paste0("chr", seqlevels(annotations))

# create chromatin assay
seurat_object[["peaks"]] <- CreateChromatinAssay(
    counts = macs2_counts,
    fragments = Fragments(seurat_object),
    annotation = annotations
)

# save results
saveRDS(seurat_object, file = paste0(project_dir, "data/Rdata/scratch/seurat_object_tumors_merged_subset.rds"))

# load datasets
full <- readRDS(paste0(project_dir, "data/Rdata/scratch/seurat_object_tumors_merged_full.rds"))
subset <- readRDS(paste0(project_dir, "data/Rdata/scratch/seurat_object_tumors_merged_subset.rds"))

# get feature matrix
full_matrix <- full[["peaks"]]$data
subset_matrix <- subset[["peaks"]]$data

# check nrow
nrow(full_matrix) # 232612
nrow(subset_matrix) # 226174

# convert features to dataframe
full_features <- str_split_fixed(rownames(full_matrix), "-", 3) %>% as.data.frame()
subset_features <- str_split_fixed(rownames(subset_matrix), "-", 3) %>% as.data.frame()

# convert dataframe to granges
query <- makeGRangesFromDataFrame(subset_features, seqnames.field = "V1", start.field = "V2", end.field = "V3")
subject <- makeGRangesFromDataFrame(full_features, seqnames.field = "V1", start.field = "V2", end.field = "V3")

# find overlaps
hits <- findOverlaps(query, subject, minoverlap = 100) # 226174 hits

# test if all peaks that are not hits in query, are 0 in cells that can be found in query
test <- full_matrix[-subjectHits(hits), ] # note that some of queryHits map to the same subjectHits... 8526 features here
test[, colnames(subset_matrix)] %>% rowSums() # all of these are non-zero!!! 

Below is the output of session info.

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/amfong/miniconda3/envs/Multiome/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Vancouver
tzcode source: system (glibc)

attached base packages:
[1] grid      stats4    stats     graphics  grDevices datasets  utils    
[8] methods   base     

other attached packages:
 [1] UpSetR_1.4.0                      lubridate_1.9.3                  
 [3] forcats_1.0.0                     stringr_1.5.0                    
 [5] purrr_1.0.2                       readr_2.1.4                      
 [7] tidyr_1.3.0                       tibble_3.2.1                     
 [9] tidyverse_2.0.0                   TFBSTools_1.40.0                 
[11] SoupX_1.6.2                       SLOcatoR_0.0.0.9000              
[13] Signac_1.12.0                     SeuratDisk_0.0.0.9021            
[15] tonsilref.SeuratData_1.0.0        SeuratData_0.2.2.9001            
[17] Seurat_5.0.0                      SeuratObject_5.0.0               
[19] sp_2.1-1                          scales_1.2.1                     
[21] rstatix_0.7.2                     RSQLite_2.3.4                    
[23] reticulate_1.34.0                 RColorBrewer_1.1-3               
[25] patchwork_1.1.3                   Matrix_1.6-1.1                   
[27] languageserver_0.3.16             jsonlite_1.8.7                   
[29] JASPAR2024_0.99.6                 BiocFileCache_2.10.1             
[31] dbplyr_2.4.0                      httpgd_1.3.1                     
[33] HCATonsilData_1.0.0               harmony_1.1.0                    
[35] Rcpp_1.0.11                       ggrepel_0.9.4                    
[37] ggpubr_0.6.0                      ggbeeswarm_0.7.2                 
[39] ggalluvial_0.12.5                 GAMBLR.data_0.1                  
[41] EnsDb.Hsapiens.v86_2.99.0         ensembldb_2.26.0                 
[43] AnnotationFilter_1.26.0           GenomicFeatures_1.54.1           
[45] AnnotationDbi_1.64.0              DoubletFinder_2.0.4              
[47] ComplexHeatmap_2.18.0             circlize_0.4.15                  
[49] chromVAR_1.24.0                   CellChat_2.1.1                   
[51] Biobase_2.62.0                    ggplot2_3.4.4                    
[53] igraph_1.5.1                      dplyr_1.1.3                      
[55] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.70.0                  
[57] rtracklayer_1.62.0                BiocIO_1.12.0                    
[59] Biostrings_2.70.1                 XVector_0.42.0                   
[61] GenomicRanges_1.54.1              GenomeInfoDb_1.38.0              
[63] IRanges_2.36.0                    S4Vectors_0.40.1                 
[65] BiocGenerics_0.48.0               BiocParallel_1.36.0              
[67] Azimuth_0.5.0                     shinyBS_0.61.1                   

loaded via a namespace (and not attached):
  [1] ica_1.0-3                     plotly_4.10.3                
  [3] zlibbioc_1.48.0               tidyselect_1.2.0             
  [5] bit_4.0.5                     doParallel_1.0.17            
  [7] clue_0.3-65                   lattice_0.22-5               
  [9] rjson_0.2.21                  blob_1.2.4                   
 [11] rngtools_1.5.2                S4Arrays_1.2.0               
 [13] parallel_4.3.2                caret_6.0-94                 
 [15] seqLogo_1.68.0                png_0.1-8                    
 [17] cli_3.6.1                     registry_0.5-1               
 [19] ProtGenerics_1.34.0           goftest_1.2-3                
 [21] gargle_1.5.2                  BiocNeighbors_1.20.1         
 [23] ggnetwork_0.5.12              uwot_0.1.16                  
 [25] curl_5.1.0                    mime_0.12                    
 [27] evaluate_0.23                 leiden_0.4.3                 
 [29] stringi_1.7.12                pROC_1.18.5                  
 [31] backports_1.4.1               XML_3.99-0.14                
 [33] httpuv_1.6.12                 magrittr_2.0.3               
 [35] rappdirs_0.3.3                splines_4.3.2                
 [37] RcppRoll_0.3.0                prodlim_2023.08.28           
 [39] DT_0.30                       sctransform_0.4.1            
 [41] DBI_1.1.3                     HDF5Array_1.30.0             
 [43] jquerylib_0.1.4               withr_2.5.2                  
 [45] class_7.3-22                  systemfonts_1.0.5            
 [47] rprojroot_2.0.3               lmtest_0.9-40                
 [49] BiocManager_1.30.22           htmlwidgets_1.6.2            
 [51] fs_1.6.3                      biomaRt_2.58.0               
 [53] SingleCellExperiment_1.24.0   statnet.common_4.9.0         
 [55] SparseArray_1.2.0             cellranger_1.1.0             
 [57] MatrixGenerics_1.14.0         annotate_1.80.0              
 [59] zoo_1.8-12                    JASPAR2020_0.99.10           
 [61] knitr_1.45                    network_1.18.2               
 [63] timechange_0.2.0              TFMPvalue_0.0.9              
 [65] foreach_1.5.2                 fansi_1.0.5                  
 [67] caTools_1.18.2                data.table_1.14.8            
 [69] timeDate_4022.108             rhdf5_2.46.0                 
 [71] R.oo_1.25.0                   poweRlaw_0.70.6              
 [73] RSpectra_0.16-1               irlba_2.3.5.1                
 [75] fastDummies_1.7.3             ellipsis_0.3.2               
 [77] lazyeval_0.2.2                yaml_2.3.7                   
 [79] survival_3.5-7                SpatialExperiment_1.12.0     
 [81] scattermore_1.2               BiocVersion_3.18.0           
 [83] crayon_1.5.2                  RcppAnnoy_0.0.21             
 [85] progressr_0.14.0              later_1.3.1                  
 [87] ggridges_0.5.4                codetools_0.2-19             
 [89] base64enc_0.1-3               GlobalOptions_0.1.2          
 [91] KEGGREST_1.42.0               Rtsne_0.16                   
 [93] shape_1.4.6                   Rsamtools_2.18.0             
 [95] filelock_1.0.2                pkgconfig_2.0.3              
 [97] xml2_1.3.5                    GenomicAlignments_1.38.0     
 [99] spatstat.sparse_3.0-3         viridisLite_0.4.2            
[101] gridBase_0.4-7                xtable_1.8-4                 
[103] car_3.1-2                     plyr_1.8.9                   
[105] httr_1.4.7                    tools_4.3.2                  
[107] globals_0.16.2                hardhat_1.3.0                
[109] beeswarm_0.4.0                broom_1.0.5                  
[111] nlme_3.1-163                  hdf5r_1.3.8                  
[113] ExperimentHub_2.10.0          shinyjs_2.1.0                
[115] digest_0.6.33                 tzdb_0.4.0                   
[117] reshape2_1.4.4                ModelMetrics_1.2.2.2         
[119] DirichletMultinomial_1.44.0   rpart_4.1.21                 
[121] glue_1.6.2                    cachem_1.0.8                 
[123] polyclip_1.10-6               generics_0.1.3               
[125] googledrive_2.1.1             presto_1.0.0                 
[127] parallelly_1.36.0             here_1.0.1                   
[129] RcppHNSW_0.5.0                carData_3.0-5                
[131] pbapply_1.7-2                 SummarizedExperiment_1.32.0  
[133] spam_2.10-0                   utf8_1.2.4                   
[135] gower_1.0.1                   gtools_3.9.4                 
[137] ggsignif_0.6.4                gridExtra_2.3                
[139] shiny_1.7.5.1                 lava_1.7.2.1                 
[141] GenomeInfoDbData_1.2.11       R.utils_2.12.2               
[143] rhdf5filters_1.14.1           RCurl_1.98-1.12              
[145] memoise_2.0.1                 rmarkdown_2.25               
[147] R.methodsS3_1.8.2             googlesheets4_1.1.1          
[149] future_1.33.0                 svglite_2.1.2                
[151] RANN_2.6.1                    renv_1.0.3                   
[153] spatstat.data_3.0-3           cluster_2.1.4                
[155] spatstat.utils_3.0-4          hms_1.1.3                    
[157] fitdistrplus_1.1-11           munsell_0.5.0                
[159] cowplot_1.1.1                 colorspace_2.1-0             
[161] FNN_1.1.3.2                   rlang_1.1.1                  
[163] ipred_0.9-14                  dotCall64_1.1-0              
[165] shinydashboard_0.7.2          xfun_0.41                    
[167] coda_0.19-4                   sna_2.7-2                    
[169] CNEr_1.38.0                   recipes_1.0.8                
[171] iterators_1.0.14              matrixStats_1.0.0            
[173] abind_1.4-5                   interactiveDisplayBase_1.40.0
[175] Rhdf5lib_1.24.0               bitops_1.0-7                 
[177] ps_1.7.5                      promises_1.2.1               
[179] DelayedArray_0.28.0           GO.db_3.18.0                 
[181] compiler_4.3.2                prettyunits_1.2.0            
[183] listenv_0.9.0                 AnnotationHub_3.10.0         
[185] tensor_1.5                    MASS_7.3-60                  
[187] progress_1.2.2                spatstat.random_3.2-1        
[189] R6_2.5.1                      fastmap_1.1.1                
[191] fastmatch_1.1-4               vipor_0.4.7                  
[193] ROCR_1.0-11                   nnet_7.3-19                  
[195] gtable_0.3.4                  KernSmooth_2.23-22           
[197] miniUI_0.1.1.1                deldir_1.0-9                 
[199] htmltools_0.5.6.1             bit64_4.0.5                  
[201] spatstat.explore_3.2-5        lifecycle_1.0.3              
[203] processx_3.8.2                callr_3.7.3                  
[205] restfulr_0.0.15               sass_0.4.7                   
[207] vctrs_0.6.4                   spatstat.geom_3.2-7          
[209] NMF_0.26                      future.apply_1.11.0          
[211] pracma_2.4.4                  bslib_0.5.1                  
[213] pillar_1.9.0                  magick_2.8.1                 
[215] GetoptLong_1.0.5   
timoast commented 9 months ago

It looks like you're using different subsets of cells here:

tumor_call | orig.ident %in% c("CLC02413", "CLC02414", "CLC02415", "CLC02506")

vs

tumor_call & orig.ident %in% include
AmosFong1 commented 9 months ago

Hi Tim, it looks confusing here but in tumor_call | orig.ident %in% I am filtering my dataset for only tumors or my control samples. In the tumor_call & orig.ident %in% include I am filtering my dataset for only tumors, and removing the control samples. The remaining samples which are not in c("CLC02413", "CLC02414", "CLC02415", "CLC02506") are the same cells in both datasets. I even used count() to count the number of each orig.ident, and confirmed same cell numbers.

timoast commented 9 months ago

So the set of cells is different, I guess you're saying that the peak calls for the tumor cell subsets are slightly different when the object contains other cells? The peak calls will get combined across the multiple cell groups (see the combine.peaks parameter) so that might explain slight differences.

AmosFong1 commented 9 months ago

Hi @timoast I agree that this probably explains why the peak calls are different, I was just having trouble wrapping my head how it would work out. I first noticed it when I got slightly different DARs, and DA results for chromVAR motifs. Anyways, thank you for your help!