refresh-bio / KMC

Fast and frugal disk based k-mer counter
273 stars 72 forks source link

feature request: support IUPAC ambiguity codes #29

Open notestaff opened 7 years ago

notestaff commented 7 years ago

When a kmer in a read includes IUPAC ambiguity codes, add all possible concretizations of that kmer.

tseemann commented 6 years ago

An interesting idea. May I ask what the "use case" for this behaviour would be?

notestaff commented 6 years ago

Re: use case, when assembling genomes we sometimes end up with SNPs, represented using ambiguity codes. It is then useful to be able to get the set of kmers in all possible strains of the virus.

First, it would be useful to document in the docs what exactly happens now, when a letter other than TCGA is encountered.

tseemann commented 6 years ago

I can see how that would be useful. The main issue is the "blowout" of kmers you get (If it's a N you will get 4*K extra kmers), plus how do you "count" them exactly. But for viruses it might be feasible.

tseemann commented 6 years ago

My guess is that any k-mers with non-AGTC are simply ignored. Here's a 50 bp read, one with no N, one with N at position 26, and using k=25:

@seq
ATGCGCTAGATGCTCTTCTAGCGATTATGCAAAGTCGGATAGATGCTGAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

   No. of k-mers below min. threshold :           26
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :           26
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :           26
   Total no. of reads                 :            1
@seqN
ATGCGCTAGATGCTCTTCTAGCGATNATGCAAAGTCGGATAGATGCTGAT
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

   No. of k-mers below min. threshold :            1
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            1
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :            1
   Total no. of reads                 :            1
notestaff commented 6 years ago

Can just have a limit, for each kmer with ambiguities include up to X concretizations. And simply ignore kmers with more than a given number of Ns.

notestaff commented 6 years ago

E.g. for protein-coding regionsI might add ambiguities for 3rd base of codons, or other less-conserved bases, to account for kmers I might expect to see.

tseemann commented 6 years ago

For viruses it may be feasible just to enumerate the kmers yourself with a simple perl/python script and store them all in a dict/hash, and avoid kmc altogether?

With the 3-rd base option, you are essentually asking for a generic "spaced-seed k-mer" option which would be quite nice (pioneered by PatternHunter >15 years ago, and now used in discontinuous-megablast), but will slow the software down somewhat.

marekkokot commented 6 years ago

Symbols other than ACGT (or lower case) are indeed ignored. Besides the "blowout" of k-mers there are other difficulties, that are related to minimizer and super k-mer based approach used in KMC. Maybe we will figure something, but for now I don't have an idea how it can be added easli. Writing own script is some option, but it may be useful to store k-mers in KMC database format to be able to perform some operations with kmc_tools. One option is to implement completely different k-mer counting approach in KMC for small input. In such a case it may be possible to not distribute k-mers based on minimizers to bins, but just "virtually" create only one bin in memory and sort its k-mers. Then implementing this feature would be quite easy (I hope :) ). Unfortunately, I think this approach will not be added to KMC in the nearest future.