Speed ?? - Githubissues

davidvilanova commented 8 years ago

Hello, I have a dataset with 230 Millions amplicons (average size 160nt). I´m using d=3 option with 10 threads. Do you have an estimation of the time required ?? What about d=1 time ?

Thanks,

frederic-mahe commented 8 years ago

Hi David,

do you have 230 million unique sequences in your dataset? or do you have x million sequences representing 230 million amplicons?

With d > 1, I don't have precise benchmarks, but that will probably take a while (several weeks?). We found a way to make d > 1 ten times faster, but we still have to implement that solution. And even with that, swarm or any other clustering solution for that matter will probably take several days to clusterize a many-million set of sequences.

Meanwhile, I don't know what kind of data you are working with, but in my own experience with 18S/16S, COI, ITS1/ITS2, and microsat markers, the options -d 1 -f (d = 1, with the fastidious option) gives very satisfying results. And it is amazingly fast!

Here are two benchmarks comparing swarm (d = 1, with and without the fastidious option) and vsearch (97%) for wallclok time and memory consumption. Please note the difference in vertical scale.

Best, tara_v9_264_samples.subsampling_swarm.pdf tara_v9_264_samples.subsampling_vsearch.pdf

davidvilanova commented 8 years ago

230 million séquences after vsearch dereplication. It s a hiseq 2x90bp 16S experiment. Not properly designed SO i get too many amplicons, 4 million per sample even after data filtering.... Thinking of subsetting the dataset Thanks for the info

frederic-mahe commented 8 years ago

The largest dataset I've dealt with contains 315 million unique sequences. It took 7-9 hours to clusterize it with swarm -d 1 -f. So, swarm should be able to deal with your dataset without subsetting.

Up to you.

PS: if the fastidious option requires too much memory, don't forget you can use the --ceiling option.

torognes / swarm

Speed ?? #87