Closed peterk87 closed 5 years ago
@peterk87, @glabbe, @dankein:
I have done a few tests and have a decent function to return all combinations of bases in an expansion. It still has a high time complexity but it no longer has the same space complexity.
I've tested it using the timeit python module for 33-mers up to n=13 (13 N bases in the k-mer). For k-mers with 3 - 6 degenerate N's in them, the code is about 17x faster.
Code is:
from itertools import product
d = {
'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}
def expand_degenerate_bases(seq):
"""return a list of all possible k-mers given a degenerate base"""
return list(map("".join, product(*map(d.get, seq))))
Here, all degenerate bases were 'N' and they were substituted into a string of 'A' bases so test ones's k-mer was: 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
and test two's k-mer was: 'AAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAA'
Thanks @DarianHole! That's a really nice and elegant solution. 👍
For highly degenerate kmers we may need to investigate alternate string search algorithms that are better suited for that use case. I think for now for the user's sake it may be a good idea to not allow running schemes that expand to over a few million sequences. What do you guys think?
Nice work @DarianHole, and I agree with @peterk87 that we should not allow schemes expanding well over a million sequences. The tool is meant to be used with highly clonal organisms: for those that are not highly clonal, a MLST-based approach should be used instead.
@peterk87 @glabbe I've added the expansion function to my local biohansel installation and it works for the test dataset I created. I did have to nest another for loop so I'll look for another way to do it. In the mean time, here is the small addition I made to the Aho-Corasick init:
def expand_degenerate_bases(seq):
"""List all possible kmers for a scheme given a degenerate base
Args:
Scheme_kmers from SNV scheme fasta file
Returns:
List of all possible kmers given a degenerate base or not
"""
d = {
'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}
return list(map("".join, product(*map(d.get, seq))))
def init_automaton(scheme_fasta):
"""Initialize Aho-Corasick Automaton with kmers from SNV scheme fasta
Args:
scheme_fasta: SNV scheme fasta file path
Returns:
Aho-Corasick Automaton with kmers loaded
"""
A = Automaton()
for header, sequence in parse_fasta(scheme_fasta):
kmer_list = expand_degenerate_bases(sequence)
for seq in kmer_list:
A.add_word(seq, (header, seq, False))
A.add_word(revcomp(seq), (header, seq, True))
A.make_automaton()
return A
I also think that 5 or 6 should be around the maximum but it really depends on the degenerate base used. n^6 = 4096 possible k-mers while using the degenerate N base. If say you only use R (A,G) you can have 12 degenerate R bases (2^12 = 4096) and have the same ~time investment~ complexity.
Do we want a check for the number of degenerate bases and cut off if its too high or should we just put a disclaimer saying that creating schemes with too many degenerate bases is going to work, but take forever for each run.
@DarianHole @peterk87 I like the idea of a disclaimer, and we could also include example runtimes for higher number of degenerate bases (or a graph with number of "N" in the scheme vs runtime using 1 core), so users will know what to expect?
@DarianHole That's exactly how I would have done it with the nested for loop. The code is very clear :+1: . My only suggestion would be to make the d
nucleotide expansion dict a global constant (maybe put it into bio_hansel/const.py
?) so it's only defined once rather than every time the expand_degenerate_bases
is called. It'd also help with being able to use the dict elsewhere.
A quick check on an user-specified custom scheme could be added just after command-line args are parsed to ensure that not too many degenerates would be created (e.g. total kmers < 100K or 1,000K; we could make this threshold a command-line parameter to allow users run really large kmer sets even if we'd advise against it). I'm imagining that the check would only ever be used on user-specified schemes rather than the ones built into biohansel. Does that sound right @glabbe ?
@glabbe I think a disclaimer is a great idea. We could even add a log warning "The scheme you specified includes that contain degenerate bases. Although your scheme contains X kmers, it will be expanded to X+Y kmer targets. This will effect run-time and memory usage." or something like that. I think it's a great idea to have some graphs in the docs showing how the number of total search kmer sequences affects run-time and memory usage.
@peterk87 yep, sounds good to me. I like the command-line parameter idea to manually increase a default max number of total k-mers. Our built-in schemes will have very few (0-5) degenerate bases I expect, based on feedback from @dankein. And they would be the degenerate bases representing just 2 possible nucleotides, not "N".
@peterk87 @glabbe I have a WIP update with the expansion and a check for the number of kmers being created.
I just need an idea on the following to finalize the check:
If a k-mer/tile/probe sequence contains one or more ambiguous bases, expand that sequence to all possible AGTC-only sequences using conventional IUPAC codes (https://www.bioinformatics.org/sms/iupac.html).
e.g.
DARN
would expand toExpanded tile sequences would be added to Aho-Corasick automaton in original and reverse complement order under the same tile info.
Some basic, inefficient Python code for doing degenerate base expansion:
where
expand_degenerate('DARN')
produces above list of possible nucleotide sequences.