Closed notestaff closed 5 years ago
Well, let me explain on a small example. Lest suppose k=3 and we have a read ACCTACG
.
Now lest assume only ACC
and ACG
are incorrect, remaining are valid.
Then each symbol of ACC
and ACG
is converted to 'N', but as CCT
, CTA
and TAC
are valid there is no reason to convert middle symbol 'T' into 'A'.
So the result is
NNNTNNN
.
Do you think it should work differently? How?
If it should work different, please reopen this issue.
When masking a fasta file with a database of 17-mers, some sequences become all Ns with short islands of non-Ns. These islands are sometimes much shorter than 17 bases. How is that possible? If only 17-mers from the database are left unmasked, shouldn't any sequence of adjacent non-Ns be at least 17 bases long? @marekkokot