Open notestaff opened 7 years ago
An interesting idea. May I ask what the "use case" for this behaviour would be?
Re: use case, when assembling genomes we sometimes end up with SNPs, represented using ambiguity codes. It is then useful to be able to get the set of kmers in all possible strains of the virus.
First, it would be useful to document in the docs what exactly happens now, when a letter other than TCGA is encountered.
I can see how that would be useful. The main issue is the "blowout" of kmers you get (If it's a N
you will get 4*K extra kmers), plus how do you "count" them exactly. But for viruses it might be feasible.
My guess is that any k-mers with non-AGTC are simply ignored. Here's a 50 bp read, one with no N, one with N at position 26, and using k=25:
@seq
ATGCGCTAGATGCTCTTCTAGCGATTATGCAAAGTCGGATAGATGCTGAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
No. of k-mers below min. threshold : 26
No. of k-mers above max. threshold : 0
No. of unique k-mers : 26
No. of unique counted k-mers : 0
Total no. of k-mers : 26
Total no. of reads : 1
@seqN
ATGCGCTAGATGCTCTTCTAGCGATNATGCAAAGTCGGATAGATGCTGAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
No. of k-mers below min. threshold : 1
No. of k-mers above max. threshold : 0
No. of unique k-mers : 1
No. of unique counted k-mers : 0
Total no. of k-mers : 1
Total no. of reads : 1
Can just have a limit, for each kmer with ambiguities include up to X concretizations. And simply ignore kmers with more than a given number of Ns.
E.g. for protein-coding regionsI might add ambiguities for 3rd base of codons, or other less-conserved bases, to account for kmers I might expect to see.
For viruses it may be feasible just to enumerate the kmers yourself with a simple perl/python script and store them all in a dict/hash, and avoid kmc altogether?
With the 3-rd base option, you are essentually asking for a generic "spaced-seed k-mer" option which would be quite nice (pioneered by PatternHunter >15 years ago, and now used in discontinuous-megablast), but will slow the software down somewhat.
Symbols other than ACGT (or lower case) are indeed ignored. Besides the "blowout" of k-mers there are other difficulties, that are related to minimizer and super k-mer based approach used in KMC. Maybe we will figure something, but for now I don't have an idea how it can be added easli. Writing own script is some option, but it may be useful to store k-mers in KMC database format to be able to perform some operations with kmc_tools. One option is to implement completely different k-mer counting approach in KMC for small input. In such a case it may be possible to not distribute k-mers based on minimizers to bins, but just "virtually" create only one bin in memory and sort its k-mers. Then implementing this feature would be quite easy (I hope :) ). Unfortunately, I think this approach will not be added to KMC in the nearest future.
When a kmer in a read includes IUPAC ambiguity codes, add all possible concretizations of that kmer.