phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Degenerate "probe" sequence expansion #60

Closed peterk87 closed 5 years ago

peterk87 commented 5 years ago

If a k-mer/tile/probe sequence contains one or more ambiguous bases, expand that sequence to all possible AGTC-only sequences using conventional IUPAC codes (https://www.bioinformatics.org/sms/iupac.html).

e.g. DARN would expand to

['AAAA',
 'AAAC',
 'AAAG',
 'AAAT',
 'AAGA',
 'AAGC',
 'AAGG',
 'AAGT',
 'GAAA',
 'GAAC',
 'GAAG',
 'GAAT',
 'GAGA',
 'GAGC',
 'GAGG',
 'GAGT',
 'TAAA',
 'TAAC',
 'TAAG',
 'TAAT',
 'TAGA',
 'TAGC',
 'TAGG',
 'TAGT']

Expanded tile sequences would be added to Aho-Corasick automaton in original and reverse complement order under the same tile info.

Some basic, inefficient Python code for doing degenerate base expansion:

biggy = {'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}

def _expand(s):
    return [biggy[x] if x in biggy else [x] for x in s]

def _trav(nt_lists, idx=0, acc=None, out=None):
    if acc is None:
        acc = []
    if out is None:
        out = []
    if idx >= len(nt_lists):
        out.append(acc)
        return

    for x in nt_lists[idx]:
        _trav(nt_lists, idx+1, acc+[x], out)
    return out

def expand_degenerate(seq):
    return [''.join(x) for x in _trav(_expand(seq))]

where expand_degenerate('DARN') produces above list of possible nucleotide sequences.

DarianHole commented 5 years ago

@peterk87, @glabbe, @dankein:

I have done a few tests and have a decent function to return all combinations of bases in an expansion. It still has a high time complexity but it no longer has the same space complexity.

I've tested it using the timeit python module for 33-mers up to n=13 (13 N bases in the k-mer). For k-mers with 3 - 6 degenerate N's in them, the code is about 17x faster.

Code is:

from itertools import product

d = {
'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}

def expand_degenerate_bases(seq):
   """return a list of all possible k-mers given a degenerate base"""

   return list(map("".join, product(*map(d.get, seq))))

degenerate_base_times

Here, all degenerate bases were 'N' and they were substituted into a string of 'A' bases so test ones's k-mer was: 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'

and test two's k-mer was: 'AAAAAAAAAAAANAAAAAAAAAAAAAAAAAAAA'

peterk87 commented 5 years ago

Thanks @DarianHole! That's a really nice and elegant solution. 👍

For highly degenerate kmers we may need to investigate alternate string search algorithms that are better suited for that use case. I think for now for the user's sake it may be a good idea to not allow running schemes that expand to over a few million sequences. What do you guys think?

glabbe commented 5 years ago

Nice work @DarianHole, and I agree with @peterk87 that we should not allow schemes expanding well over a million sequences. The tool is meant to be used with highly clonal organisms: for those that are not highly clonal, a MLST-based approach should be used instead.

DarianHole commented 5 years ago

@peterk87 @glabbe I've added the expansion function to my local biohansel installation and it works for the test dataset I created. I did have to nest another for loop so I'll look for another way to do it. In the mean time, here is the small addition I made to the Aho-Corasick init:

def expand_degenerate_bases(seq):
    """List all possible kmers for a scheme given a degenerate base

    Args:
         Scheme_kmers from SNV scheme fasta file

    Returns:
         List of all possible kmers given a degenerate base or not
    """
    d = {
'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}

    return list(map("".join, product(*map(d.get, seq))))

def init_automaton(scheme_fasta):
    """Initialize Aho-Corasick Automaton with kmers from SNV scheme fasta

    Args:
        scheme_fasta: SNV scheme fasta file path

    Returns:
         Aho-Corasick Automaton with kmers loaded
    """
    A = Automaton()
    for header, sequence in parse_fasta(scheme_fasta):
        kmer_list = expand_degenerate_bases(sequence)
        for seq in kmer_list:
            A.add_word(seq, (header, seq, False))
            A.add_word(revcomp(seq), (header, seq, True))
    A.make_automaton()
    return A

I also think that 5 or 6 should be around the maximum but it really depends on the degenerate base used. n^6 = 4096 possible k-mers while using the degenerate N base. If say you only use R (A,G) you can have 12 degenerate R bases (2^12 = 4096) and have the same ~time investment~ complexity.

Do we want a check for the number of degenerate bases and cut off if its too high or should we just put a disclaimer saying that creating schemes with too many degenerate bases is going to work, but take forever for each run.

glabbe commented 5 years ago

@DarianHole @peterk87 I like the idea of a disclaimer, and we could also include example runtimes for higher number of degenerate bases (or a graph with number of "N" in the scheme vs runtime using 1 core), so users will know what to expect?

peterk87 commented 5 years ago

@DarianHole That's exactly how I would have done it with the nested for loop. The code is very clear :+1: . My only suggestion would be to make the d nucleotide expansion dict a global constant (maybe put it into bio_hansel/const.py?) so it's only defined once rather than every time the expand_degenerate_bases is called. It'd also help with being able to use the dict elsewhere.

A quick check on an user-specified custom scheme could be added just after command-line args are parsed to ensure that not too many degenerates would be created (e.g. total kmers < 100K or 1,000K; we could make this threshold a command-line parameter to allow users run really large kmer sets even if we'd advise against it). I'm imagining that the check would only ever be used on user-specified schemes rather than the ones built into biohansel. Does that sound right @glabbe ?

@glabbe I think a disclaimer is a great idea. We could even add a log warning "The scheme you specified includes that contain degenerate bases. Although your scheme contains X kmers, it will be expanded to X+Y kmer targets. This will effect run-time and memory usage." or something like that. I think it's a great idea to have some graphs in the docs showing how the number of total search kmer sequences affects run-time and memory usage.

glabbe commented 5 years ago

@peterk87 yep, sounds good to me. I like the command-line parameter idea to manually increase a default max number of total k-mers. Our built-in schemes will have very few (0-5) degenerate bases I expect, based on feedback from @dankein. And they would be the degenerate bases representing just 2 possible nucleotides, not "N".

DarianHole commented 5 years ago

@peterk87 @glabbe I have a WIP update with the expansion and a check for the number of kmers being created.

I just need an idea on the following to finalize the check: