Closed hjack123 closed 4 years ago
Yes, I agree this is an important problem and am working on a way to do cluster-aware peak calling, but this won't be ready anytime soon.
In the mean time, you can split the bam file by cluster and perform peak calling on each file separately, then combine the peaks. I have a package for extracting groups of cells from a bam file here: https://github.com/timoast/sinto
What do you recommend using when re-calling peaks?
Here's a little snippet of code that I wrote. It requires a few things.
Perhaps in the future this could be implemented more elegantly :)
seurat <- readRDS(file = "seurat.RDS") # A seurat with called clusters.
split_BAMS <- function(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample"){
for (n in seq(num_clusters)){
cells <- colnames(Seurat[,Seurat@meta.data$seurat_clusters == as.character(n-1)]) # 0-indexed clusers.
# If you have run the cellranger aggr pipeline, you may need to combine replicates or samples.
# This is accomplished by running on the corrected names in replicates.
# If not, you would only use r1 and not merge, etc.
r1 <- subset(cells, word(cells, start = 2, sep = "-") == "1") # Replicate 1
r2 <- subset(cells, word(cells, start = 2, sep = "-") == "2") # Replicate 2
r2 <- gsub(pattern = "-2", replacement = "-1", r2, fixed = T) # Fix names of Replicate 2
write.table(r1, quote = F,
file = paste0(name, "_Cluster_",n-1,"_R1.txt"), row.names = F, col.names = F)
write.table(r2, quote = F,
file = paste0(name, "_Cluster_",n-1,"_R2.txt"), row.names = F, col.names = F)
system(paste0("subset-bam -b /cellranger-atac-count-Replicate1/outs/possorted_bam.bam -c ./",
name, "_Cluster_", n-1 ,"_R1.txt --cores ",cores," -o /data/R1/outs/bam_subset/",name,"_Cluster_", n-1,"_R1.bam"))
system(paste0("subset-bam -b /cellranger-atac-count-Replicate2/outs/possorted_bam.bam -c ./",
name, "_Cluster_", n-1 ,"_R2.txt --cores ",cores," -o /data/R2/outs/bam_subset/",name,"_Cluster_", n-1,"_R2.bam"))
}
}
split_BAMS(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample")
# If you don't have replicates, no need to merge.
# Update the genome size in MACS2 call!
Merge_and_MACS <- function(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample"){
for (n in seq(num_clusters)){
system(paste0("samtools merge -@ ",cores," /Combined/outs/bam_subset/",
name, "_Cluster_",n-1,".bam ", "/data/R1/outs/bam_subset/",name,"_Cluster_", n-1,"_R1.bam",
" /data/R2/outs/bam_subset/",name,"_Cluster_", n-1,"_R2.bam"))
system(paste0("samtools sort -@ ", cores," /Combined/outs/bam_subset/",
name, "_Cluster_",n-1,".bam > /Combined/outs/bam_subset/",
name, "_Cluster_",n-1,"_sorted.bam"))
system(paste0("samtools index -@ ",cores, " /Combined/outs/bam_subset/",
name, "_Cluster_",n-1,"_sorted.bam"))
system(paste0("macs2 callpeak -t /Combined/outs/bam_subset/",
name, "_Cluster_",n-1,"_sorted.bam", " -n ",name,"_Cluster_",n-1," -f BAM --nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR -g 1218492533"))}}
Merge_and_MACS(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample")
Thanks @Austin-s-h
You could also split the bam file into all the different clusters in one pass using sinto, which would be faster.
Longer-term, I'm interested in building tools to call peaks from the fragment file directly. This would avoid the time-consuming step of splitting the bam file altogether.
Tim,
Your set up for Sinto is wonderful and works great for MACs2. I was wondering how/if we should normalize samples for the number of reads per barcode for cluster peak calling after using Sinto as cells with higher levels of sequencing could disproportionately influence peak calling? With the sequencing depth influencing the number of cells you need to reach bulk levels of peak calling, how do you recommend approaching the differences in cell number between each cluster and its effects on peak calling?
Do you think it would be possible to reintroduce the called peaks from MAC2s for individual clusters combined into a new gene barcode matrix for processing by Signac? This would help the resolution of the differential peak calling, cell typing, visualization purposes, and downstream analysis.
Thanks for your help!
@DrDrLitte I don't think there's much you can do to account for differences in number of cells between samples, it's a limitation that very small clusters may be underpowered to detect those cluster-specific peaks accurately. It's probably better to try to be less stringent in the peak calling to minimize the number of peaks that are missed in those cases.
My suggested workflow, with the caveat that I have not looked into the different options for peak calling in great detail yet, would be the following:
GenomicRanges::reduce
Signac::FeatureMatrix
Thanks Tim! I'll give it a shot!
This is now available on the develop branch (see here for installation instructions).
The CallPeaks()
function can be used to run MACS2 on either the full dataset (default), or separately for different groups of cells and then combine the separate peak calls (by setting the group.by
argument)
for multiome analysis (ATAC + expression), how to assign ATAC peaks to specific clusters?
Hi Tim, Would you consider adding a function to call peaks for each cluster? To see some peaks (from rare cell types) might have been missed from calling from the bulk? Thanks..