stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
328 stars 88 forks source link

peak calling #261

Closed guoyanzhao closed 4 years ago

guoyanzhao commented 4 years ago

Hello, I'm new to scATAC analysis. I watched your presentation in the recent Illumina event and would like to use signac in our analysis. You mentioned that signac does a better job in calling peaks. However I could not find example code to do that in your website. It seems you are using CellRanger peak count matrix and the fragment file as input to construct the chromatin assay and seurat object instead of using the peaks called by your method. Maybe I'm missing something important. could you please explain? Thank you very much! Guoyan

timoast commented 4 years ago

Hi, the new peak calling functionality is currently available on the develop branch (not in the current CRAN release). See the website for instructions for installing the develop branch.

The function is CallPeaks(). I don't have examples on the website yet but it's quite straightforward to run, and you can look at the function documentation. By default it will run MACS2 on all cells combined and return a set of genomic ranges, which you can then quantify using FeatureMatrix(). Alternatively, you can call peaks for different groups of cells separately by using the group.by parameter. This will run MACS2 separately on each group of cells, then combine the peak calls using GenomicRanges::reduce().

guoyanzhao commented 4 years ago

Hi Tim, Thank you very much for your prompt response! I'm wondering weather you redo cell calling after using your new peak calling method. Right now you are using the peak-cell matrix from CellRanger output as your input, correct? I feel there are some serious  issues with 10X method of peak calling and cell calling (please see below for explanation). Do you mind share your experience on how to improve the cell calling step? This is how 10X  peak calling algorithm works:The window size, the odds-ratio and the distance to merge peaks were selected to maximize the area under the receiver-operator curve when the peaks are compared to a high-quality list of DNase hypersensitive sites for GM12878 from ENCODE, and also to produce satisfactory clustering metrics as well as cell type identification in a set of peripheral blood mononuclear cells (PBMC) libraries. The problem: For other cell types and tissues the parameters may not work because they will have open chromatin regions very different from the cell types that the parameters were optimized on. 

This is how 10X cell calling algorithm works: Having determined peaks prior to this, we use the number of fragments that overlap any peak regions, for each barcode, to separate the signal from noise. This works better in practice as compared to naively using the number of fragments per barcode.  The problem: The cell calling step depends on the peak calling results. if the peak calling result is not optimal then the cell calling results will be bad. 

Thank you very much! Best,Guoyan

On Monday, September 28, 2020, 05:56:46 PM CDT, Tim Stuart notifications@github.com wrote:

Hi, the new peak calling functionality is currently available on the develop branch (not in the current CRAN release). See the website for instructions for installing the develop branch.

The function is CallPeaks(). I don't have examples on the website yet but it's quite straightforward to run, and you can look at the function documentation. By default it will run MACS2 on all cells combined and return a set of genomic ranges, which you can then quantify using FeatureMatrix(). Alternatively, you can call peaks for different groups of cells separately by using the group.by parameter. This will run MACS2 separately on each group of cells, then combine the peak calls using GenomicRanges::reduce().

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

timoast commented 4 years ago

To find the total number of fragments per cell (not the number of fragments in peaks, but the total), you can run CountFragments(). You can then use those counts to set an initial cutoff of cells to include in the peak quantification using FeatureMatrix(). We don't have an automated inflection-point-based method of calling cells in Signac, but I've found this is only really needed for automated analysis. You can look at the distribution of counts and decide where to draw a reasonable total fragment cutoff for your dataset.

Rough example:

totals <- CountFragments(fragments = <path.to.fragfile>)
hist(log10(totals$frequency_count), breaks = 100)
cells.use <- totals[totals$frequency_count > 1000, "CB"]
counts <- FeatureMatrix(fragments = <fragment.object>, features = <peaks>, cells = cells.use)
guoyanzhao commented 4 years ago

Thank you very much Tim! I'll give it a try.  On Tuesday, September 29, 2020, 09:34:14 AM CDT, Tim Stuart notifications@github.com wrote:

To find the total number of fragments per cell (not the number of fragments in peaks, but the total), you can run CountFragments(). You can then use those counts to set an initial cutoff of cells to include in the peak quantification using FeatureMatrix(). We don't have an automated inflection-point-based method of calling cells in Signac, but I've found this is only really needed for automated analysis. You can look at the distribution of counts and decide where to draw a reasonable total fragment cutoff for your dataset.

Rough example: totals <- CountFragments(fragments = ) hist(log10(totals$frequency_count), breaks = 100) cells.use <- totals[totals$frequency_count > 1000, "CB"] counts <- FeatureMatrix(fragments = , features = , cells = cells.use) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.