smithlabcode / preseq

Software for predicting library complexity and genome coverage in high-throughput sequencing.
https://preseq.readthedocs.io
GNU General Public License v3.0
78 stars 16 forks source link

Duplication and gc_extrap #51

Open nickp60 opened 3 years ago

nickp60 commented 3 years ago

I wanted to note a discrepancy between sequencing data with and without duplicates. I know this is mentioned in the manuscript, Issue #31, and issue #24, but from what I can tell it seems that preseq is strongly affected by duplication, particularly as duplication rates increases. I unfortunately can't share the data presented below, but the experiment consisted on creating a single library from a single-cell amplification. This was sequenced on both a miniseq and a NovaSeq, both to around 5M reads. Notably, the NovaSeq has a known higher duplication rate. I aligned with BWA-MEM to the human genome hg38, and converted both the raw alignment and the deduplicated alignment with bam2mr. In our hands, the miniseq has ~0.4% duplication rate, and the novaseq has ~8% duplication rate.

This is the result of preseq's gc_extrap on the miniseq/novaseq raw/deduplicated data.

image

Again, I am sorry that I can't share the data (at least publicly, we might be able to with an NDA with my company). Have you run into anything like this before? It seems contrary to the figure presented in the original paper, so perhaps this is some weird characteristic of novaseq data. I have seen this now across about 20 of these miniseq/novaseq pairs.

Thanks, any input would be appreciated!

vincent-hanlon commented 3 years ago

I've found that Preseq complexity estimates with gc_extrap are only reliable with a >5% duplication rate and with >20,000 reads per library (roughly speaking). Maybe this could explain the discrepancy you're seeing?

The text below is from an email to the developers I sent last summer about this. It explains why I think gc_extrap doesn't work well when libraries aren't sequenced very deeply:

"This morning I noticed something odd that I hope you can comment on. I was using preseq to forecast the expected duplication rates for a set of libraries with deeper sequencing. As a sanity check, I used preseq to calculate the expected duplication rate with the current amount of sequencing, and compared it with the observed duplication rate. I noticed that these two duplication rates matched only if the observed duplication rate was >5% and if the number of aligned reads was >20,000 (See photo of Excel summary, below). Any idea why this might be?

"Some more detail:​ I used the number of aligned reads for each library and the read length to get a value for TOTAL_BASES (the dependent variable for preseq). I found the value for EXPECTED_COVERED_BASES on the complexity curve corresponding to the value for TOTAL_BASES. I then calculated the preseq duplication rate as (TOTAL_BASES - EXPECTED_COVERED_BASES) / TOTAL_BASES. The preseq duplication rate and the observed duplication rate correspond perfectly for most libraries, except for in the cases outlined above."

(I'm a user of Preseq, not a developer)

image

andrewdavidsmith commented 2 years ago

@nickp60 @vincent-hanlon Preseq will not work well if it is sequenced too shallowly relative to the complexity. The total amount of observations required is low, but there needs to be enough information about duplication. This is a theoretical problem: imagine if there were no duplicates at all, then the only logical conclusion is that continued sampling would result in everything being a new observation. If the duplication rate is lower than perfectly uniform sampling, then similar problems arise. The bootstrap resampling should provide an indication of whether the result is reliable. I know it's been a while, but if you could send me your counts histogram by email I can try to see if there's a better way for preseq to report when the estimates are potentially unreliable. Thanks.