waveygang / wfmash

base-accurate DNA sequence alignments using WFA and mashmap3
MIT License
174 stars 18 forks source link

Don't use hypergeometric model for low-complexity segments. #215

Closed bkille closed 9 months ago

bkille commented 10 months ago

Don't use hypergeometric model for low-complexity segments, where a segment is "low-complexity" if the number of distinct kmers is less than 75% of the total number of kmers.

For segments with low kmer complexity, we now examine all candidate windows from the L1 stage, even if their maximum predicted ANI cannot be higher than the threshold. This is because the ANI predictions at low-complexity are more noisy.

Since such a small fraction of segments are low-complexity, the mapping stage takes an extra ~10% cpu time now. Alternatively, low complexity mappings can be skipped entirely with --kmer-complexity F

ekg commented 9 months ago

What'd you decide or see here?

bkille commented 9 months ago

Im closing this for now until I have a good example of it helping and also until I can show that it doesn't introduce additional overlapping alignments.