Closed jamilla-az closed 1 year ago
I don't think I understand the question fully, but if you are uncertain about the accuracy of pop_size
I suggest to use lc_extrap
directly and look at the final extrapolation numbers. Here is how that would look using some example data from the repo. I build the preseq
binary in a build
dir from the root of the cloned repo. Then I did the following commands:
$ ./build/preseq lc_extrap -e 1000000000 -H data/SRR1301329_1M_hist.txt | tail -1
999000000.0 146525544.8 131942692.8 205805533.9
$ ./build/preseq lc_extrap -e 10000000000 -H data/SRR1301329_1M_hist.txt | tail -1
9999000000.0 168902221.6 149779484.3 256122314.3
$ ./build/preseq lc_extrap -e 100000000000 -H data/SRR1301329_1M_hist.txt | tail -1
99999000000.0 171518994.8 151829984.2 262543417.5
as you can see, after 1000000000
"reads" the final estimate is 146525544
unique. This increases after extrapolating an additional 10x, and then again after another 10x reaching 171518994
unique. But I can be fairly confident this is essentially the "population size" because I can look at how much it has converged:
$ ./build/preseq lc_extrap -e 100000000000 -H data/SRR1301329_1M_hist.txt | tail -10
99990000000.0 171518968.2 151829963.5 262543351.7
99991000000.0 171518971.2 151829965.8 262543359.0
99992000000.0 171518974.1 151829968.1 262543366.3
99993000000.0 171518977.1 151829970.4 262543373.6
99994000000.0 171518980.0 151829972.7 262543381.0
99995000000.0 171518983.0 151829975.0 262543388.3
99996000000.0 171518985.9 151829977.3 262543395.6
99997000000.0 171518988.9 151829979.6 262543402.9
99998000000.0 171518991.8 151829981.9 262543410.2
99999000000.0 171518994.8 151829984.2 262543417.5
Over the years I've found this approach to be robust across a broader set of circumstances than using pop_size
directly. The pop_size
is more accurate in some "typical" situations, but when it is not accurate, using lc_extrap
in the way I demonstrated above often works very well.
Thank you for your suggestion. Apologies for being unclear, my question is mostly about whether one needs to control for or somehow remove PCR duplicates (different amplicons of the same DNA fragment as I understand it) prior to using Preseq for estimation.
I see. No, you absolutely must not remove the duplicates prior to using preseq. The point of preseq is to model the duplication rates to make inferences about the population of molecules, including those unsequenced.
Thanks, I appreciate you answering what is in retrospect a silly question 😅 I was hung up on technical vs biological duplicates (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/8%20Duplicate%20Sequences.html) and if Preseq is meant to work with only biological duplicates, rather than technical ones.
I am trying to use Preseq to estimate the unique sequence richness using
pop_size
on a low input library that has been highly amplified by PCR. I am concerned that the high rate of PCR duplicates could be leading to a wrong estimation. I was wondering if there is anything I need to implement for this particular library to make the estimates more accurate or if this isn’t something to be concerned about.My current approach is to use all of the reads without removal of PCR or optical duplicates for estimation.
Thank you for your help!