onecodex / needletail

Fast FASTX parsing and k-mer methods in Rust
MIT License
174 stars 20 forks source link

add `num_gaps` function for record #67

Closed wjwei-handsome closed 1 year ago

wjwei-handsome commented 1 year ago

Hi, for some statistic of fast[a,q] files, count the gaps('n' or 'N', maybe '-' but not considered here) in a sequence could be necessary.

So, I add such a functions simply, and add a test case.

Best wishes.

wjwei-handsome commented 1 year ago

Hi, I have fixed the clippy check warings.

audy commented 1 year ago

Hi @wjwei-handsome, thanks for your contribution. The role of the needletail library is limited to parsing fastx files and generating kmers. Interpretation of the sequences in those files is best left to the user (or another library) due to the diversity of encoding choices used in bioinformatics. For example, encoding gaps as - is just a convention, and N can mean "any base" or "Asparagine" (at least if you're dealing with IUPAC standards).

Also, I suggest using a regular expression or some other means to count characters in the sequence rather than iterating over the sequence twice as this can be inefficient especially for large sequences.

wjwei-handsome commented 1 year ago

Thanks, I ignored it. My fault!

Looking forward to the update :)