stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
331 stars 88 forks source link

calling peaks on each cluster #74

Closed hjack123 closed 4 years ago

hjack123 commented 4 years ago

Hi Tim, Would you consider adding a function to call peaks for each cluster? To see some peaks (from rare cell types) might have been missed from calling from the bulk? Thanks..

timoast commented 4 years ago

Yes, I agree this is an important problem and am working on a way to do cluster-aware peak calling, but this won't be ready anytime soon.

In the mean time, you can split the bam file by cluster and perform peak calling on each file separately, then combine the peaks. I have a package for extracting groups of cells from a bam file here: https://github.com/timoast/sinto

Connorr0 commented 4 years ago

What do you recommend using when re-calling peaks?

timoast commented 4 years ago

I haven't compared between methods, but you could try MACS2 or Genrich

Austin-s-h commented 4 years ago

Here's a little snippet of code that I wrote. It requires a few things.

  1. 10X cellranger-count or cellranger aggr as your input.
  2. The tool from 10X genomics called 'subset-bam' either on your $PATH or referenced absolutely. (This could be replaced by the sinto tool)
  3. MACS2 installed and on your $PATH
  4. Some minor tweaks for file locations depending on your situation.
  5. Create empty folders for the output.

Perhaps in the future this could be implemented more elegantly :)

seurat <- readRDS(file = "seurat.RDS") # A seurat with called clusters.

split_BAMS <- function(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample"){
  for (n in seq(num_clusters)){
    cells <- colnames(Seurat[,Seurat@meta.data$seurat_clusters == as.character(n-1)]) # 0-indexed clusers.
    # If you have run the cellranger aggr pipeline, you may need to combine replicates or samples.
    # This is accomplished by running on the corrected names in replicates.
    # If not, you would only use r1 and not merge, etc.
    r1 <- subset(cells, word(cells, start = 2, sep = "-") == "1") # Replicate 1
    r2 <- subset(cells, word(cells, start = 2, sep = "-") == "2") # Replicate 2
    r2 <- gsub(pattern = "-2", replacement = "-1", r2, fixed = T) # Fix names of Replicate 2
    write.table(r1, quote = F,
                file = paste0(name, "_Cluster_",n-1,"_R1.txt"), row.names = F, col.names = F)
    write.table(r2, quote = F,
                file = paste0(name, "_Cluster_",n-1,"_R2.txt"), row.names = F, col.names = F)
    system(paste0("subset-bam -b /cellranger-atac-count-Replicate1/outs/possorted_bam.bam -c ./",
                  name, "_Cluster_", n-1 ,"_R1.txt --cores ",cores," -o /data/R1/outs/bam_subset/",name,"_Cluster_", n-1,"_R1.bam"))
    system(paste0("subset-bam -b /cellranger-atac-count-Replicate2/outs/possorted_bam.bam -c ./",
                  name, "_Cluster_", n-1 ,"_R2.txt --cores ",cores," -o /data/R2/outs/bam_subset/",name,"_Cluster_", n-1,"_R2.bam"))
  }
}
split_BAMS(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample")

# If you don't have replicates, no need to merge.
# Update the genome size in MACS2 call!
Merge_and_MACS <- function(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample"){
  for (n in seq(num_clusters)){
    system(paste0("samtools merge -@ ",cores," /Combined/outs/bam_subset/",
                  name, "_Cluster_",n-1,".bam ", "/data/R1/outs/bam_subset/",name,"_Cluster_", n-1,"_R1.bam",
                  " /data/R2/outs/bam_subset/",name,"_Cluster_", n-1,"_R2.bam"))
    system(paste0("samtools sort -@ ", cores," /Combined/outs/bam_subset/",
                  name, "_Cluster_",n-1,".bam > /Combined/outs/bam_subset/",
                  name, "_Cluster_",n-1,"_sorted.bam"))
    system(paste0("samtools index -@ ",cores, " /Combined/outs/bam_subset/",
                  name, "_Cluster_",n-1,"_sorted.bam"))
    system(paste0("macs2 callpeak -t /Combined/outs/bam_subset/",
                  name, "_Cluster_",n-1,"_sorted.bam", " -n ",name,"_Cluster_",n-1," -f BAM --nomodel --shift 100 --ext 200 --qval 5e-2 -B --SPMR -g 1218492533"))}}

Merge_and_MACS(Seurat = seurat, num_clusters = 8, cores = 64, name = "Sample")
timoast commented 4 years ago

Thanks @Austin-s-h

You could also split the bam file into all the different clusters in one pass using sinto, which would be faster.

Longer-term, I'm interested in building tools to call peaks from the fragment file directly. This would avoid the time-consuming step of splitting the bam file altogether.

DrDrLitte commented 4 years ago

Tim,

Your set up for Sinto is wonderful and works great for MACs2. I was wondering how/if we should normalize samples for the number of reads per barcode for cluster peak calling after using Sinto as cells with higher levels of sequencing could disproportionately influence peak calling? With the sequencing depth influencing the number of cells you need to reach bulk levels of peak calling, how do you recommend approaching the differences in cell number between each cluster and its effects on peak calling?

Do you think it would be possible to reintroduce the called peaks from MAC2s for individual clusters combined into a new gene barcode matrix for processing by Signac? This would help the resolution of the differential peak calling, cell typing, visualization purposes, and downstream analysis.

Thanks for your help!

timoast commented 4 years ago

@DrDrLitte I don't think there's much you can do to account for differences in number of cells between samples, it's a limitation that very small clusters may be underpowered to detect those cluster-specific peaks accurately. It's probably better to try to be less stringent in the peak calling to minimize the number of peaks that are missed in those cases.

My suggested workflow, with the caveat that I have not looked into the different options for peak calling in great detail yet, would be the following:

  1. Cluster cells based on bulk-level peak calling or genome bins
  2. Split bam file by cluster using sinto
  3. Call peaks on each cluster using genrich with the recommended setting for bulk ATAC-seq
  4. Unify peaks from each cluster using GenomicRanges::reduce
  5. Quantify the unified set of peaks for all cells using Signac::FeatureMatrix
  6. Proceed with standard analysis (LSI, new clustering, annotate cell types, etc)
DrDrLitte commented 4 years ago

Thanks Tim! I'll give it a shot!

timoast commented 4 years ago

This is now available on the develop branch (see here for installation instructions).

The CallPeaks() function can be used to run MACS2 on either the full dataset (default), or separately for different groups of cells and then combine the separate peak calls (by setting the group.by argument)

slives-lab commented 1 year ago

for multiome analysis (ATAC + expression), how to assign ATAC peaks to specific clusters?