torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

clustering of huge amount of reads #156

Closed mariabernard closed 3 years ago

mariabernard commented 4 years ago

Dear all,

I am trying to cluster a huge amount of sequences : 25 million unique sequences (corresponding to 6 different runs of MiSeq). My protocol is to performe the clustering in two steps : swarm with d1 and then swarm with d3

My problem is that on this number of sequences it seems to take infinite time. I try 3 different version of swarm.

clustering d1 took:

it results in 17 million clusters The second clustering step is running since 16 days.

Do you have any advice on the time it could take ?

Do you think that an evolution of swarm could be to provide as input an existing clustering result ?

Regards

Maria

frederic-mahe commented 4 years ago

Hi @mariabernard thanks for trying swarm.

We've tested swarm d = 1 with datasets of up to 350 million unique sequences, so there is room for growth on your side if you need to.

Computation time when clustering with d > 1 is proportional to the square of the number of sequences. That's the main problem with clustering, you need to compare each sequence with most of the others, and that's a quadratic complexity. We've managed to solve that problem for d = 1 (swarm d = 1 is fast as you can see), but not for d > 1, and I don't think we will make any progress on this anytime soon.

Now, it depends on what you are trying to do, but you might not need that second round with d = 3. In my own research, I don't perform a second round of clustering. I use swarm with d = 1 and the fastidious option, which gives you results close to that of d = 2, but much faster and without loosing clustering resolution.

I never filter out reads before clustering. I prefer to apply filters after clustering, on representative sequences, when arbitrary filters are less likely to eliminate important things. After clustering, I eliminate rare clusters that occur in only one sample (singletons and some doubletons, like in most popular pipelines). I eliminate low quality sequences, chimeras, eventually sequences that are too short or too long, sequences without a taxonomic assignment, etc. Usually that enough to reduce drastically the number of clusters (often 99% less).

Recently, I've started to use lulu to merge cluster that are very similar and co-occur all the time. That's similar to a second round of clustering, but guided by the actual distribution of clusters in your samples. That will further reduce the number of clusters you need to work with.

I hope this will help you.