refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

bug? hard masking #106

Closed notestaff closed 5 years ago

notestaff commented 5 years ago

When masking a fasta file with a database of 17-mers, some sequences become all Ns with short islands of non-Ns. These islands are sometimes much shorter than 17 bases. How is that possible? If only 17-mers from the database are left unmasked, shouldn't any sequence of adjacent non-Ns be at least 17 bases long? @marekkokot

marekkokot commented 5 years ago

Well, let me explain on a small example. Lest suppose k=3 and we have a read ACCTACG. Now lest assume only ACC and ACG are incorrect, remaining are valid. Then each symbol of ACC and ACG is converted to 'N', but as CCT, CTA and TAC are valid there is no reason to convert middle symbol 'T' into 'A'. So the result is NNNTNNN. Do you think it should work differently? How?

marekkokot commented 5 years ago

If it should work different, please reopen this issue.