iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

variantKmers script is not pulling out all the right kmers #55

Closed iqbal-lab closed 7 years ago

iqbal-lab commented 7 years ago

Placeholder for meeting with @sm0179 and @rffrancon on Monday (at which point will modify this text). Sorina spotted that this script is somehow missing kmers - as you increase k, we are finding fewer reads get mapped when using a .precalc file of kmers overlapping variants.

sm0179 commented 7 years ago

this issue https://github.com/iqbal-lab-org/gramtools/issues/44 refers to the same thing

iqbal-lab commented 7 years ago

Ah yes, I see. The reason for this @rffrancon is that currently the exact mapper is v simple - is starts at the end of the read and keeps going. So if the end of the read does not overlap a variant, the kmer at the end of the read is not in the .precalc file/hash , and the quasimapper discards the read.

iqbal-lab commented 7 years ago

(An alternative solution is for the quasimapper to check all the kmers in the read, and discard the read if none of them are in the hash). But makes more sense to add a few more kmers to the hash I think

sm0179 commented 7 years ago

yes, although that would mean slow quasimapping times for reads that end in k-mers that aren't in the hash

iqbal-lab commented 7 years ago

Yes, that's why I said it made more sense to add a few more kmers to the hash

ffranr commented 7 years ago

Currently testing new implementation.

ffranr commented 7 years ago

I've written tests for this feature. I think that it works correctly.