smithlabcode / preseq

Software for predicting library complexity and genome coverage in high-throughput sequencing.
https://preseq.readthedocs.io
GNU General Public License v3.0
78 stars 16 forks source link

Support for fastq? [Feature Request] #36

Closed MatthewRalston closed 6 years ago

MatthewRalston commented 6 years ago

Hi Tim, love your paper.

We ran into similar issues to #28 after converting a fastq to unaligned SAM/BAM with BBMap. I understand that coordinate-based pileup is the more efficient option for the UMI, but in a way it limits the applicability of preseq across NGS applications. In contrast, storing the sequences along with their counts in a hashmap would/could be prohibitively expensive.

So alternatively, you could take the approach of 2-bit encoding of the read like the python library kPAL. You could generate the vector from unique sequences instead of positions, which would be valuable for highly uniform libraries like 16S and other amplicon sequencing, where mapping is less relevant, accurate, or possible. Then your modeling/rarefaction process would probably be similar.

timydaley commented 6 years ago

The manual details how to process fastq files to obtain unique counts which can then be used as input for preseq (section 4).

timydaley commented 6 years ago

I should note that we don't usually trust using the full sequence, as a large proportion of sequenced read are likely to have errors. The complexity is likely to be overestimated in the initial sample, and this will propagate to the extrapolation. Mapping the reads corrects this issue, but is not meant for all applications.