jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

More reads, less peaks #33

Closed daikisato12 closed 4 years ago

daikisato12 commented 4 years ago

Hi there,

We're now trying to see saturation pattern to check how deep we should sequence by downsampling bam files by sambamba. However, in spite of my understanding that we would get more peaks as we have more reads (as discussed in previous issues #10 #11 ), we observed the opposite result in our data. Do you have any idea why we get these results? or is there any options that I should have specified?

Here is a command that I used. Genrich -t sample1.sorted_byreadname.bam -o sample1.narrowPeak -m 20 -x -j -y -r -e chrM -v

I hope some results below would be any help to answer this question.

BAM records analyzed: 3263697 Unmapped: 1972923 MAPQ < 20: 274860 Paired alignments: 1008858 secondary alns: 304 "orphan" alns: 4648 Warning! Unpaired alignments: 7056 secondary alns: 112 PCR duplicates -- Paired aln sets: 502105 duplicates: 180046 (35.9%) Discordant aln sets: 1141 duplicates: 49 (4.3%) Singleton aln sets: 4662 duplicates: 1088 (23.3%) Fragments analyzed: 327817 Full fragments: 322059 Half fragments: 5758 (from unpaired alns) ATAC-seq cut sites: 649876 (expanded to length 100bp) control file #0 not provided - Background pileup value: 0.073978 Peak-calling parameters: Genome length: 786308905bp Significance threshold: -log(p) > 2.000 Min. AUC: 200.000 Max. gap between sites: 100bp Peaks identified: 122396 (34801703bp)

And apparently we get higher background pileup value as we use more reads, which would be the reason of this symptom, but is there any way to keep this value stable?

Proportion of used reads Number of peaks Background pileup value 0.1 122396 0.073978 0.2 42237 0.141181 0.3 30031 0.203330 0.4 26876 0.260777 0.5 24528 0.313687 0.6 23754 0.362484 0.7 23183 0.407876 0.8 22516 0.450062 0.9 21793 0.489142 1 21787 0.525398

jsh58 commented 4 years ago

Thanks for the question, and for including the log files. Here are a few thoughts:

daikisato12 commented 4 years ago

Thanks John!

Although the problem regarding "orphan" alns: ** Warning! ** was solved by changing aligner to Bowtie2 from BWA(-MEM), I still have similar issue like below even after removing duplicate reads (samtools markdup) before peak calling.

The length of peaks, not just the number, is probably more relevant here

You may be right, because we can see some rise in total length of peaks as we use more reads at some point though I don't know what this means.

If the background gets low enough (such as was seen in #29), a single read (cut site) can be considered significant. That was likely the case in your 10% downsampling example.

To be honest, I have no idea about quality assessment of ATAC-seq reads. Simply because I have seen the saturation analysis before, I wanted to see that in our data set too. So do you think we can not get reliable results in our case that a single read (cut site) can be considered significant?

Proportion Peaks Length_of_peaks Background_pileup 0.1 161167 33970521 0.047955 0.2 63120 22959931 0.094933 0.3 40851 18423198 0.141231 0.4 26392 8131498 0.186956 0.5 23598 8582471 0.232592 0.6 21494 8944253 0.277774 0.7 19983 9290546 0.322837 0.8 19161 6937310 0.367794 0.9 18905 7405495 0.412583 1 18674 7857703 0.457352

Best, Daiki

jsh58 commented 4 years ago

So do you think we can not get reliable results in our case that a single read (cut site) can be considered significant?

If your background level is so low that a single observation achieves significance, I would not consider that a reasonable result.

Simply because I have seen the saturation analysis before, I wanted to see that in our data set too.

The analysis may have been more appropriate for the other dataset. Again, your background level is already so low (0.457) that subsetting is unlikely to be informative.

daikisato12 commented 4 years ago

Thank you so much! I would explore more our data.

Best reagards,

Daiki