torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Wrong cluster id in the uc output #108

Closed yoann-dufresne closed 6 years ago

yoann-dufresne commented 6 years ago

I'm was experiencing a weird bug in a uc file parser. The number of sequences in the clusters was not the same that the number of sequence entered in swarm and you'll see soon why.

Here is a subsample of the uc file that I got:
C 227575 15 ISU_721155;size=1;
S 227575 340 ISU_721155;size=1;
H 376617 340 99.7 + 0 0 340M ISU_744801;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_721945;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_722021;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_722248;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_722447;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_722729;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_722885;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_864488;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_721326;size=1; ISU_721155;size=1;
H 376617 340 99.4 + 0 0 340M ISU_723148;size=1; ISU_721155;size=1;
H 376617 340 98.8 + 0 0 340M ISU_723254;size=1; ISU_721155;size=1;
H 376617 340 99.1 + 0 0 340M ISU_887926;size=2; ISU_721155;size=1;
H 376617 340 99.1 + 0 0 340M ISU_722451;size=1; ISU_721155;size=1;
H 376617 340 99.1 + 0 0 340M ISU_723476;size=1; ISU_721155;size=1;

As you can see, all the sequences have the same "centroid" label "ISU_721155;size=1;". But all the H lines are considered as part of the 376617 cluster when the centroid himself is in the cluster 227575. So, the centroid of the cluster isn't in the same cluster !
When I looked further on a few examples, the concerned centroids seem to always be alone in their cluster.

So, for now I'll use the centroid name instead of the cluster number as unique identifier.

Yoann

torognes commented 6 years ago

Thank you for reporting this issue. This looks weird. Could you please indicate the version of Swarm you were running as well as all command line options? If it is possible to provide the 14 sequences in the example, that would also valuable.

yoann-dufresne commented 6 years ago

Version: Swarm v2.1.13

Command: swarm input.fasta -o /dev/null -u swarm.uc -z -w centroids.fasta -t 8 -f

Sequences

frederic-mahe commented 6 years ago

Hi, I've been trying to make a minimal example that triggers the bug. We already have unit tests to check that the OTU number is correct, but I had no success modifying them to reproduce Yoann's bug.

So far, I've noticed that if the fasta entry ISU_887926;size=2; is removed from Yoann's toy sample, then the OTU number is correct for all hits.

frederic-mahe commented 6 years ago

Hi,

I've minimized Yoann's example:

printf ">a_1\nTGGA\n>b_2\nTTTT\n>c_1\nTTGA\n>d_1\nCTGA\n" | \
    swarm -f -u swarm.uc &> /dev/null
cat swarm.uc

The UC output is indeed incorrect (OTU number is 1 instead of 0):

C   0   4   *   *   *   *   *   a_1 *
S   0   4   *   *   *   *   *   a_1 *
H   1   4   75.0    +   0   0   4M  c_1 a_1
H   1   4   50.0    +   0   0   4M  d_1 a_1
H   1   4   25.0    +   0   0   4M  b_2 a_1

The internal structure is as such:

a   c   1   1   1
c   d   1   1   2
c   b   2   1   2

a-c-d
  |
  b

If the abundance of b is set to 1 instead of 2, then the output is correct. For some reason, removing the sequence d breaks the cluster.

torognes commented 6 years ago

Swarm shows the wrong cluster number on the H lines of UC files when using fastidious mode. The number shown is the original cluster number, before the fastidious grafting.

The problem was easy to correct and I have committed a fix.

torognes commented 6 years ago

The problem seems to have been there since the fastidious option was introduced.

torognes commented 6 years ago

For some reason, removing the sequence d breaks the cluster.

This is because the heavy swarm (with just a and c) is reduced in weight from 3 to 2 and is not considered a heavy swarm anymore. No grafting occurs in that case.

The incorrect cluster numbers are those of the light swarms that are grafted onto heavy swarms.

frederic-mahe commented 6 years ago

OK. As I don't use it myself, the UC output was not completely covered by unit tests. I've pushed a new test to swarm-tests to make sure that bug will never come back. Many thanks to @yoann-dufresne for his report!

yoann-dufresne commented 6 years ago

You're welcome ! I'll do it again if necessary :)

torognes commented 6 years ago

Fixed in version 2.2.0 just released.