Poor likelihood score when terminal spanning k-mers have high abundance

I've recently encountered a few calls that were assigned a very poor likelihood score despite the fact that they were true variants (short indels). Looking more closely at the abundance of the spanning k-mers, the reason was immediately clear. The abundances looked something like this:

proband: 64,14,15,14,16,18,15,16,...
mother: 41,0,0,0,0,0,0,0,0,...
father: 38,0,0,1,0,0,1,0,...

The first k-mer is obviously not a unique k-mer. It doesn't occur in the reference genome (or else it would have been discarded), but it's high abundance in all samples.

All spanning k-mers that are unique to the variant are typically kept for likelihood calculations, even those that don't satisfy the thresholds to be "interesting" k-mers. However, these represent extreme cases that may need to be handled more delicately.

kevlar-dev / kevlar

Poor likelihood score when terminal spanning k-mers have high abundance #349