Residual variability in swarm clustering results

@fescudie reported a (small) variability in swarm clustering results when the input order is modified. That variability comes from situations where a rare amplicon can be captured by two more abundant amplicons. The clustering process being sequential (and binary), the rare amplicon can be assigned to only one of the more abundant amplicon, which creates a slight asymmetry. That can be summarized by the toy-example below:

CLUSTERS_A=$(mktemp)
CLUSTERS_B=$(mktemp)
echo -e ">a_2\nAA\n>b_2\nTT\n>c_1\nAT\n" | \
    "${SWARM}" -o "${CLUSTERS_A}" 2> /dev/null > /dev/null
echo -e ">b_2\nTT\n>a_2\nAA\n>c_1\nAT\n" | \
    "${SWARM}" -o "${CLUSTERS_B}" 2> /dev/null > /dev/null
cmp -s "${CLUSTERS_A}" "${CLUSTERS_B}"
rm "${CLUSTERS_A}" "${CLUSTERS_B}"

In that example, swarm produces two different clustering results:

AA--AT  TT
or
AA  AT--TT

If we strengthen swarm's internal sorting by sorting on headers too, we can make sure swarm results are always the same, whatever the input order of fasta entries (provided that the fasta headers are all unique). That's the only way I can think of to eliminate that residual variability, but that's completely arbitrary.

torognes / swarm

Residual variability in swarm clustering results #80