torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
125 stars 23 forks source link

Residual variability in swarm clustering results #80

Closed frederic-mahe closed 7 years ago

frederic-mahe commented 8 years ago

@fescudie reported a (small) variability in swarm clustering results when the input order is modified. That variability comes from situations where a rare amplicon can be captured by two more abundant amplicons. The clustering process being sequential (and binary), the rare amplicon can be assigned to only one of the more abundant amplicon, which creates a slight asymmetry. That can be summarized by the toy-example below:

CLUSTERS_A=$(mktemp)
CLUSTERS_B=$(mktemp)
echo -e ">a_2\nAA\n>b_2\nTT\n>c_1\nAT\n" | \
    "${SWARM}" -o "${CLUSTERS_A}" 2> /dev/null > /dev/null
echo -e ">b_2\nTT\n>a_2\nAA\n>c_1\nAT\n" | \
    "${SWARM}" -o "${CLUSTERS_B}" 2> /dev/null > /dev/null
cmp -s "${CLUSTERS_A}" "${CLUSTERS_B}"
rm "${CLUSTERS_A}" "${CLUSTERS_B}"

In that example, swarm produces two different clustering results:

AA--AT  TT
or
AA  AT--TT

If we strengthen swarm's internal sorting by sorting on headers too, we can make sure swarm results are always the same, whatever the input order of fasta entries (provided that the fasta headers are all unique). That's the only way I can think of to eliminate that residual variability, but that's completely arbitrary.

frederic-mahe commented 7 years ago

@fescudie implemented in swarm 2.1.13