Open apcamargo opened 4 years ago
Hi Antônio! You're welcome and thanks for opening this issue.
I think you're right that there's a gap between CanonicalKmers
(which fits some of our specific use-cases) and Kmers
(which is a much more general method that can feed into many different bioinformatics use-cases). I would hesitate to change the API on Kmers
and I think CanonicalKmers
is also a bit tricky to change because of how the reverse complement is passed as an argument, but I think adding another iterator would be completely appropriate?
Perhaps a SkipKmers
struct with a new(buffer: &'a [u8], strict: bool, nucleic: bool, k: u8)
method and Iterator::Item
is (usize, &'a [u8])
would be the right API? strict
could act as a switch for whether to allow non-ACGT IUPAC bases (maybe it should be called iupac
instead?) and nucleic
would determine which alphabet to use (nucleic vs amino acid).
Hi @bovee
Yes, I think a SkipKmers
with the options you described would pair nicely with CanonicalKmers
without changing the current API.
I'm unsure whether nucleic
is a good name for an option that may trigger amino acid k-mer counting. Maybe alphabet
would be a better option, but then it wouldn't make sense for it to be a bool
.
I like the idea of an alphabet
parameter replacing the two above, but I agree that it would make sense to introduce some kind of Enum for that. I've avoided that previously because that kind of construct can devolve into kind of a mess (e.g. BioPython's: https://biopython.org/DIST/docs/api/Bio.Alphabet-module.html ), but if the only values were Nucleic
, IupacNucleic
, and AminoAcid
(with maybe something around gapped or DNA/RNA specific alphabets if necessary) I think that'd be reasonable.
Hi Roderick!
I'm developing a k-mer counting Python package for internal usage and I'm using needletail as a backend. While developing it, I noticed that
Kmers
andCanonicalKmers
are inconsistent regarding non-ATCG characters. WhileKmers
count them, they are skipped byCanonicalKmers
(understandably so).Because of that, my function only uses
CanonicalKmers
even when counting non-canonical k-mers (I just reverse complement the sequence ifcanonical
boolean is true), which causes additional computational burden.I don't know if this decision was made by design, but maybe
Kmers
should include an argument that allows the user to choose whether non-ATCG characters should be ignored.Thank you for all your work in needletail!