torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Visualize swarms for d>1 #97

Closed steph43 closed 7 years ago

steph43 commented 7 years ago

I'm using visualizations of a few target taxa to explore the differences between d=1 and d-2. graph_plot.py seems to be working fine for d=2, but I am unclear as to how the graphics are defined.

From Swarm V2 (2015) "Edges in these networks only represent the parameter d used; the length of the edges carries no information. The nodes in the networks represent amplicons."

It seems that the term amplicons is used to described unique sequences. If this is the case, it seems that either each node in the network must represent binning of multiple amplicons (each either d=2 or d=1 from their nearest neighbor node), or the edges in the network must represent both distances of 2 and distances of 1.

I haven't been able to parse out what's happenign from the script (I'm not very familiar with Python!) but I'm hoping you can explain what's happening. Thanks!

frederic-mahe commented 7 years ago

Hi @steph43,

yes amplicon means unique sequence. Another sequence with at least 1 difference will be represented by another node in the graphical representation. When using d = 2, edges can represent either distances of 1 or 2, without any visual distinction.

Let's assume you have only one isolated OTU, with the same number of amplicons whatever the d value used for clustering. All graphical representations (d = 1, d = 2, d = 3, etc) will have the same number of nodes, but the number of segmented paths will decrease as d increases (I hope I did not make things less clear).

frederic-mahe commented 7 years ago

I am going to close that issue. Feel free to re-open it if my answer does not cover completely the initial question.

dcm9123 commented 5 years ago

Hi! I have a question regarding this matter. When I run swarm in fastidious mode that means that my -d will be automatically 1 (so it will look for at least one SNP between reads, correct?), however, when I visualize the clusters using graph_plot.py and I adjust my -d 3, does that mean that the clusters (could I call them haplotypes?) visualized will be the only ones with at least 3 SNPs or indels? Also, when I have my central OTU with a bunch of amplicons marked (say 500), and I find a node away from it with no number, does that mean that node has only 1 amplicon different from the others? In this case would I have 2 haplotypes/clusters? or just one that is away from the central OTU?

Thanks!

frederic-mahe commented 5 years ago

Swarm's default is to link amplicons with a single difference (one insertion or deletion or substitution), that's -d 1. With the fastidious option, swarm will allow a double difference (one insertion or deletion or substitution, twice) to link low-abundant amplicons to the closest cluster, assuming that intermediate amplicons were not observed for stochastic reasons.

If you produce visualization plots for clusters obtained with the -d 1 fastidious option, then amplicons connected during the fastidious phase will be "floating" (no edge) around. This is because the python script does not take into account edges representing more than one step, i.e. more than d (amplicons linked during the fastidious phase have 2 differences with the amplicon they are connected to).

If you produce visualization plots for clusters obtained with -d 2 or more, the edges in the graph will represent at most d differences (could be 1, 2, ... up to d).

dcm9123 commented 5 years ago

Great! Thanks a lot! That clarifies it! (: