lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.38k stars 308 forks source link

cutN penalty to identify all Ns? #187

Open nikostr opened 2 years ago

nikostr commented 2 years ago

I'm interested in running cutN to identify all regions of Ns in my sequence. If I'm understanding the code correctly, regions of Ns are interrupted if the score becomes negative, and score corresponds to number of Ns - number of non-Ns * penalty. A penalty of zero gives a region starting from the first N and going to the end of the sequence, and small penalties lead to regions of Ns being merged, with the non-N sequences being discarded. To ensure exact regions of Ns, the penalty needs to be sufficient to always be bigger than the contiguous number of Ns prior to the first non-N, with a too small penalty leading to regions of Ns being merged. Am I understanding this correctly? Would it make sense to have a way of explicitly extracting all contiguous regions of Ns? This could perhaps be done by having reserved penalty values (e.g. 0 or 1000000000), or by adding a flag to support this behavior?