smithlabcode / preseq

Software for predicting library complexity and genome coverage in high-throughput sequencing.
https://preseq.readthedocs.io
GNU General Public License v3.0
78 stars 16 forks source link

Question: Does two reads with opposite strand with same mapping location regarded as duplicates? #67

Open zztin opened 1 year ago

zztin commented 1 year ago

Hi Andrew,

I have a nanopore sequencing library where one molecule could result in reads on both forward or reverse strands. I would like to know if preseq regard these pairs as "distinct" observation, or as "repeated" observation?

The definition of "distinct" observation is not very clearly stated in the documentation. I was trying to find it in the implementation:

Related lines: https://github.com/smithlabcode/preseq/blob/c0fc953764029728d0c1e8fb7b33464b041794c5/src/load_data_for_complexity.cpp#L90 Related class: GenomicRegion.hpp

It seems like get_start() is getting the smaller coordinate of a bam read? If so, does it mean that both reverse and forward-mapped reads would be seen as the same molecule (same 'start') -- This is the desired behavior in my use case.

I would love to hear from you!

The command I used:

./preseq lc_extrap -B -o ./yield-estimates.txt <input-sorted-bam>

Software version: v2.0 downloaded precompiled binary

Best, Li-Ting

andrewdavidsmith commented 1 year ago

@zztin I think you are probably right about the meaning of that code. Frankly, it's a weak criterion, but it should work for generating big-picture complexity results. It would not be a great approach if the purpose were, e.g., to accurately identify SNVs. It might lose information that would help at some important locus. Also, I've only ever considered how this should work for short reads.

Unfortunately I don't think I can answer your question now and I will leave this issue open, but here's one thing that might help: the best way to use Preseq is by providing a counts histogram (or the counts themselves). That would allow you to count duplicates according to your own definition.

I'll try go address this in more detail soon.