Exact string matching strategy (special case d = 1)

frederic-mahe commented 10 years ago

When d = 1, how many micro-variants can a sequence have?

At most 7L + 4, where L is the length of the seed.

A micro-variant is a sequence differing from the seed sequence by only one substitution, insertion or deletion. The number of possible micro-variants is surprisingly low, and it is reasonably fast to generate all possible sequences. If we store our fasta entries as a mapping of sequences to amplicon names and abundances, we can turn swarming into a problem of exact string matching.

It works as such: parse the entire dataset and store it in a mapping structure. For a given seed sequence, produce all its unique micro-variants. Check if these micro-variants are in the mapping structure. If yes, add them to the swarm as sub-seeds and mark them as swarmed.

The complexity of the computation now depends on how fast the mapping structure can be queried, and how fast we can create the micro-variants sets. The micro-variant sets have to be created for all amplicons (9L + 4 operations), and at most 7L + 4 look-ups are necessary for each amplicon. Depending on the complexity of look-ups, the global complexity could be in O(n log n).

A pure python implementation is only 10 times slower than swarm (C++) on a dataset containing 77,693 unique sequences of average length 129 bp: 25 s for swarm, 240 s for swarm.py. I will run tests on much larger datasets to see if the new implementation outperforms the C++ implementation when n increases.

frederic-mahe commented 10 years ago

About micro-variants

For each sequence of length L, we can create 9L + 4 sequences (in a brute-force approach). If we keep only the unique sequences and eliminate those similar to the original sequence, we obtain a number of micro-variants m, such as:

6n + 5 <= m <= 7n + 4

The number of possible micro-variants is variable. It is at its lowest for homopolymers (sequences made entirely of one nucleotide repeated L times). It is at its highest for sequences free from homopolymers. For example, a homopolymer free sequence will have 3L mutations, 4(L + 1) - L insertions, and L deletions: 7L + 4 in total. If the sequence is entirely made of the same nucleotide type (long homopolymer), then the number of possible deletions becomes 1, which gives 6L + 5. These numbers were verified by simulation and observed on real sequences.

frederic-mahe commented 10 years ago

A test on a larger datasets (2.3 million unique sequences, 129 bp on average) shows that the new swarm algorithm (python implementation) is more than 4 times faster than the swarm algorithm (C++ implementation). It took 5 h vs. 23 h to finish (on one core)!

As the python implementation seems to behave linearly, that speed difference will increase for larger datasets. The largest 18S V9 dataset I have access to contains 32 million unique sequences. The projected duration is less than 4 days for the python implementation, where swarm C++ took more than 40 days to complete (on 16 cores). I also have access to a very large 16S dataset (mix of 100 bp and 150 bp long sequences) containing more than 130 million unique sequences. As far as I know, it is not possible to clusterize it with extant clustering algorithms. A C++ implementation of our new algorithm may be able to do it in a couple of days.