Closed frederic-mahe closed 10 years ago
For each sequence of length L, we can create 9L + 4 sequences (in a brute-force approach). If we keep only the unique sequences and eliminate those similar to the original sequence, we obtain a number of micro-variants m, such as:
6n + 5 <= m <= 7n + 4
The number of possible micro-variants is variable. It is at its lowest for homopolymers (sequences made entirely of one nucleotide repeated L times). It is at its highest for sequences free from homopolymers. For example, a homopolymer free sequence will have 3L mutations, 4(L + 1) - L insertions, and L deletions: 7L + 4 in total. If the sequence is entirely made of the same nucleotide type (long homopolymer), then the number of possible deletions becomes 1, which gives 6L + 5. These numbers were verified by simulation and observed on real sequences.
A test on a larger datasets (2.3 million unique sequences, 129 bp on average) shows that the new swarm algorithm (python implementation) is more than 4 times faster than the swarm algorithm (C++ implementation). It took 5 h vs. 23 h to finish (on one core)!
As the python implementation seems to behave linearly, that speed difference will increase for larger datasets. The largest 18S V9 dataset I have access to contains 32 million unique sequences. The projected duration is less than 4 days for the python implementation, where swarm C++ took more than 40 days to complete (on 16 cores). I also have access to a very large 16S dataset (mix of 100 bp and 150 bp long sequences) containing more than 130 million unique sequences. As far as I know, it is not possible to clusterize it with extant clustering algorithms. A C++ implementation of our new algorithm may be able to do it in a couple of days.
This idea is now implemented in SWARM v1.2.8 and is available with the option -a
. Currently it does not use threads.
Parallelized it with threads today. Will push tomorrow after some more testing. It now does 9.5 million sequences (avg 130bp long) in 20 minutes on 4 cores. Extrapolation indicates that the set of 130 million sequences should take about 5h. Not bad. :)
A more efficient parallelized version is now available in version 1.2.9. It clusters the 9.5 million sequences into 1646172 swarms in less than 8 minutes using 10 threads on a Linux 16 core machine. On my 4-core Mac it takes less than 11 minutes using 6 threads.
The new strategy is clearly a huge improvement when d=1. The big question is whether it can be utilised in some way also when d>1.
Extending directly to d = 2 would multiple computation time by appr. 7L (where L is the median sequence length). So its probably prohibitive. We need to invent something new!
Yes, we need some other way to handle d>1. Lets try out some ideas. Closing this issue now.
When d = 1, how many micro-variants can a sequence have?
At most 7L + 4, where L is the length of the seed.
A micro-variant is a sequence differing from the seed sequence by only one substitution, insertion or deletion. The number of possible micro-variants is surprisingly low, and it is reasonably fast to generate all possible sequences. If we store our fasta entries as a mapping of sequences to amplicon names and abundances, we can turn swarming into a problem of exact string matching.
It works as such: parse the entire dataset and store it in a mapping structure. For a given seed sequence, produce all its unique micro-variants. Check if these micro-variants are in the mapping structure. If yes, add them to the swarm as sub-seeds and mark them as swarmed.
The complexity of the computation now depends on how fast the mapping structure can be queried, and how fast we can create the micro-variants sets. The micro-variant sets have to be created for all amplicons (9L + 4 operations), and at most 7L + 4 look-ups are necessary for each amplicon. Depending on the complexity of look-ups, the global complexity could be in O(n log n).
A pure python implementation is only 10 times slower than swarm (C++) on a dataset containing 77,693 unique sequences of average length 129 bp: 25 s for
swarm
, 240 s forswarm.py
. I will run tests on much larger datasets to see if the new implementation outperforms the C++ implementation when n increases.