onecodex / needletail

Fast FASTX parsing and k-mer methods in Rust
MIT License
173 stars 18 forks source link

Add a new iterator to skip "bad" bases without canonicalization #40

Open apcamargo opened 4 years ago

apcamargo commented 4 years ago

Hi Roderick!

I'm developing a k-mer counting Python package for internal usage and I'm using needletail as a backend. While developing it, I noticed that Kmers and CanonicalKmers are inconsistent regarding non-ATCG characters. While Kmers count them, they are skipped by CanonicalKmers (understandably so).

Because of that, my function only uses CanonicalKmers even when counting non-canonical k-mers (I just reverse complement the sequence if canonical boolean is true), which causes additional computational burden.

I don't know if this decision was made by design, but maybe Kmers should include an argument that allows the user to choose whether non-ATCG characters should be ignored.

Thank you for all your work in needletail!

bovee commented 4 years ago

Hi Antônio! You're welcome and thanks for opening this issue.

I think you're right that there's a gap between CanonicalKmers (which fits some of our specific use-cases) and Kmers (which is a much more general method that can feed into many different bioinformatics use-cases). I would hesitate to change the API on Kmers and I think CanonicalKmers is also a bit tricky to change because of how the reverse complement is passed as an argument, but I think adding another iterator would be completely appropriate?

Perhaps a SkipKmers struct with a new(buffer: &'a [u8], strict: bool, nucleic: bool, k: u8) method and Iterator::Item is (usize, &'a [u8]) would be the right API? strict could act as a switch for whether to allow non-ACGT IUPAC bases (maybe it should be called iupac instead?) and nucleic would determine which alphabet to use (nucleic vs amino acid).

apcamargo commented 4 years ago

Hi @bovee

Yes, I think a SkipKmers with the options you described would pair nicely with CanonicalKmers without changing the current API.

I'm unsure whether nucleic is a good name for an option that may trigger amino acid k-mer counting. Maybe alphabet would be a better option, but then it wouldn't make sense for it to be a bool.

bovee commented 4 years ago

I like the idea of an alphabet parameter replacing the two above, but I agree that it would make sense to introduce some kind of Enum for that. I've avoided that previously because that kind of construct can devolve into kind of a mess (e.g. BioPython's: https://biopython.org/DIST/docs/api/Bio.Alphabet-module.html ), but if the only values were Nucleic, IupacNucleic, and AminoAcid (with maybe something around gapped or DNA/RNA specific alphabets if necessary) I think that'd be reasonable.