Best practices with low-input library with high rate of PCR duplicates

jamilla-az commented 1 year ago

I am trying to use Preseq to estimate the unique sequence richness using pop_size on a low input library that has been highly amplified by PCR. I am concerned that the high rate of PCR duplicates could be leading to a wrong estimation. I was wondering if there is anything I need to implement for this particular library to make the estimates more accurate or if this isn’t something to be concerned about.

My current approach is to use all of the reads without removal of PCR or optical duplicates for estimation.

Thank you for your help!

andrewdavidsmith commented 1 year ago

I don't think I understand the question fully, but if you are uncertain about the accuracy of pop_size I suggest to use lc_extrap directly and look at the final extrapolation numbers. Here is how that would look using some example data from the repo. I build the preseq binary in a build dir from the root of the cloned repo. Then I did the following commands:

$ ./build/preseq lc_extrap -e 1000000000 -H data/SRR1301329_1M_hist.txt | tail -1
999000000.0 146525544.8 131942692.8 205805533.9
$ ./build/preseq lc_extrap -e 10000000000 -H data/SRR1301329_1M_hist.txt | tail -1                                                               
9999000000.0    168902221.6 149779484.3 256122314.3
$ ./build/preseq lc_extrap -e 100000000000 -H data/SRR1301329_1M_hist.txt | tail -1
99999000000.0   171518994.8 151829984.2 262543417.5

as you can see, after 1000000000 "reads" the final estimate is 146525544 unique. This increases after extrapolating an additional 10x, and then again after another 10x reaching 171518994 unique. But I can be fairly confident this is essentially the "population size" because I can look at how much it has converged:

$ ./build/preseq lc_extrap -e 100000000000 -H data/SRR1301329_1M_hist.txt | tail -10
99990000000.0   171518968.2 151829963.5 262543351.7
99991000000.0   171518971.2 151829965.8 262543359.0
99992000000.0   171518974.1 151829968.1 262543366.3
99993000000.0   171518977.1 151829970.4 262543373.6
99994000000.0   171518980.0 151829972.7 262543381.0
99995000000.0   171518983.0 151829975.0 262543388.3
99996000000.0   171518985.9 151829977.3 262543395.6
99997000000.0   171518988.9 151829979.6 262543402.9
99998000000.0   171518991.8 151829981.9 262543410.2
99999000000.0   171518994.8 151829984.2 262543417.5

Over the years I've found this approach to be robust across a broader set of circumstances than using pop_size directly. The pop_size is more accurate in some "typical" situations, but when it is not accurate, using lc_extrap in the way I demonstrated above often works very well.

jamilla-az commented 1 year ago

Thank you for your suggestion. Apologies for being unclear, my question is mostly about whether one needs to control for or somehow remove PCR duplicates (different amplicons of the same DNA fragment as I understand it) prior to using Preseq for estimation.

andrewdavidsmith commented 1 year ago

I see. No, you absolutely must not remove the duplicates prior to using preseq. The point of preseq is to model the duplication rates to make inferences about the population of molecules, including those unsequenced.

jamilla-az commented 1 year ago

Thanks, I appreciate you answering what is in retrospect a silly question 😅 I was hung up on technical vs biological duplicates (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html) and if Preseq is meant to work with only biological duplicates, rather than technical ones.

smithlabcode / preseq

Best practices with low-input library with high rate of PCR duplicates #66