kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Poor likelihood score when terminal spanning k-mers have high abundance #349

Closed standage closed 5 years ago

standage commented 5 years ago

I've recently encountered a few calls that were assigned a very poor likelihood score despite the fact that they were true variants (short indels). Looking more closely at the abundance of the spanning k-mers, the reason was immediately clear. The abundances looked something like this:

The first k-mer is obviously not a unique k-mer. It doesn't occur in the reference genome (or else it would have been discarded), but it's high abundance in all samples.

All spanning k-mers that are unique to the variant are typically kept for likelihood calculations, even those that don't satisfy the thresholds to be "interesting" k-mers. However, these represent extreme cases that may need to be handled more delicately.