Open zztin opened 1 year ago
@zztin I think you are probably right about the meaning of that code. Frankly, it's a weak criterion, but it should work for generating big-picture complexity results. It would not be a great approach if the purpose were, e.g., to accurately identify SNVs. It might lose information that would help at some important locus. Also, I've only ever considered how this should work for short reads.
Unfortunately I don't think I can answer your question now and I will leave this issue open, but here's one thing that might help: the best way to use Preseq is by providing a counts histogram (or the counts themselves). That would allow you to count duplicates according to your own definition.
I'll try go address this in more detail soon.
Hi Andrew,
I have a nanopore sequencing library where one molecule could result in reads on both forward or reverse strands. I would like to know if preseq regard these pairs as "distinct" observation, or as "repeated" observation?
The definition of "distinct" observation is not very clearly stated in the documentation. I was trying to find it in the implementation:
Related lines: https://github.com/smithlabcode/preseq/blob/c0fc953764029728d0c1e8fb7b33464b041794c5/src/load_data_for_complexity.cpp#L90 Related class: GenomicRegion.hpp
It seems like get_start() is getting the smaller coordinate of a bam read? If so, does it mean that both reverse and forward-mapped reads would be seen as the same molecule (same 'start') -- This is the desired behavior in my use case.
I would love to hear from you!
The command I used:
Software version: v2.0 downloaded precompiled binary
Best, Li-Ting