Closed mariabernard closed 3 years ago
Hi @mariabernard thanks for trying swarm.
We've tested swarm d = 1 with datasets of up to 350 million unique sequences, so there is room for growth on your side if you need to.
Computation time when clustering with d > 1 is proportional to the square of the number of sequences. That's the main problem with clustering, you need to compare each sequence with most of the others, and that's a quadratic complexity. We've managed to solve that problem for d = 1 (swarm d = 1 is fast as you can see), but not for d > 1, and I don't think we will make any progress on this anytime soon.
Now, it depends on what you are trying to do, but you might not need that second round with d = 3. In my own research, I don't perform a second round of clustering. I use swarm with d = 1 and the fastidious option, which gives you results close to that of d = 2, but much faster and without loosing clustering resolution.
I never filter out reads before clustering. I prefer to apply filters after clustering, on representative sequences, when arbitrary filters are less likely to eliminate important things. After clustering, I eliminate rare clusters that occur in only one sample (singletons and some doubletons, like in most popular pipelines). I eliminate low quality sequences, chimeras, eventually sequences that are too short or too long, sequences without a taxonomic assignment, etc. Usually that enough to reduce drastically the number of clusters (often 99% less).
Recently, I've started to use lulu to merge cluster that are very similar and co-occur all the time. That's similar to a second round of clustering, but guided by the actual distribution of clusters in your samples. That will further reduce the number of clusters you need to work with.
I hope this will help you.
Dear all,
I am trying to cluster a huge amount of sequences : 25 million unique sequences (corresponding to 6 different runs of MiSeq). My protocol is to performe the clustering in two steps : swarm with d1 and then swarm with d3
My problem is that on this number of sequences it seems to take infinite time. I try 3 different version of swarm.
clustering d1 took:
it results in 17 million clusters The second clustering step is running since 16 days.
Do you have any advice on the time it could take ?
Do you think that an evolution of swarm could be to provide as input an existing clustering result ?
Regards
Maria