torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

Empty internal files after running swarm with -d 0 #148

Closed dcm9123 closed 4 years ago

dcm9123 commented 4 years ago

Hi! I've been looking at the output from Swarm by tuning different parameters into it. I've tried with -d 1, -d 2, and -d 3. However, when I run it using -d 0, I get empty internal files, and therefore I cannot plot the network by using graph_plot.py. Do you know why this could be happening?

Thanks in advance!

frederic-mahe commented 4 years ago

Hi @dcastaneda5 this is expected. If your input file is properly dereplicated, when using -d 0 swarm does not link amplicons. There are no clusters, and consequently, no internal structure, so -i output files are empty.

dcm9123 commented 4 years ago

So, shouldn't I be getting clusters with the abundance values across amplicons that are exactly the same across each other? and linked to other clusters by 1 difference? If not, is there any way I could do this? Thanks!

frederic-mahe commented 4 years ago

getting clusters with the abundance values across amplicons that are exactly the same across each other?

Hi @dcastaneda5 sorry for the late reply. I am not sure to understand your goal here. You seem to be trying to use swarm with a partially dereplicated dataset. Are you trying to work with individual fasta files (one per sample for example) without dereplicating them globally?

dcm9123 commented 4 years ago

Hi Frederic,

I just realized that Swarm was not design to work with non-dereplicated reads, correct? That is because I was trying to find the OTU clusters that had no differences between them and linked to other clusters by only one difference, hence the -d 0. Basically the dereplication process removes equal reads and then clusters abundance reads with -d to at least 1, right?

Thanks for your insight.

frederic-mahe commented 4 years ago

Basically the dereplication process removes equal reads and then clusters abundance reads with -d to at least 1, right?

The dereplication process merges reads that are strictly identical. Swarm can do that with -d 0 -w new_representatives.fasta but the recommended way to perform that dereplication step is with vsearch (https://github.com/torognes/swarm#dereplication-mandatory). Once you have a dereplicated fasta file, you can try higher d values to link sequences with up to d differences.

If that OK with you, I am going to close that issue. Feel free to open another issue if you encounter another problem.