torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

uc file question #172

Closed dougwyu closed 1 year ago

dougwyu commented 2 years ago

i have a question about the size= information.

I ran swarm with -d 1 and -f options, and i get query & centroid information in the *.uc file as follows:

F-11-8--64_65_53;size=7071 F-11-2--19007_11562_11980;size=12628875

F-11-8--64_65_53 is parsed as SAMPLE--PCR1-PCR2-PCR3, where PCR1|2|3 is the number of reads of that haplotype in each sequence. In the input fasta, size = sum(PCR1+PCR2+PCR3). So size= should still be 64+65+53 = 182, not 7071

but 7071 ≠ 12628875 either, so i'm trying to figure out where 7071 comes from.

Is 7071 the cluster size after initial clustering but before fastidious clustering?

frederic-mahe commented 2 years ago

hello @dougwyu

swarm accepts either _[1-9][0-9]*$ or ;size=[1-9][0-9]*;?$ abundance annotations (using regular expression notation), not both. This is controlled by the option -z.

When using the default abundance annotation (_[1-9][0-9]*$), if there are several pattern occurrences (e.g.; s_64_65_53), then only the last one is used _53. Swarm does not perform a sum of the successive patterns.

Also, note that the uc output gives you the number of hits in your cluster (line C, column 3), not the sum of all abundances in your cluster (which would be 9 + 1 = 10 in this toy-example):

printf ">s1_9\nAA\n>s2_1\nAT\n" | \
    swarm -d 1 -f -u - -o /dev/null -l /dev/null 
C   0   2   *   *   *   *   *   s1_9    *
S   0   2   *   *   *   *   *   s1_9    *
H   0   2   50.0    +   0   0   2M  s2_1    s1_9

Regarding the size=7071 and size=12628875 values in your example, I would need to see your original input to know why swarm is outputting these particular values.

dougwyu commented 2 years ago

sorry, i was a bit unclear.

in my fasta headers, i am assuming that swarm is ignoring everything before the semicolon and only takes size= 182 as the sequence's abundance (given that i'm using -z)

the input sequence header looked like this:

F-11-8--64_65_53;size= 182

but i found that after running swarm -d, the uc output looks like this:

F-11-8--64_65_53;size=7071 F-11-2--19007_11562_11980;size=12628875

somehow,

F-11-8--64_65_53;size= 182

was changed to

F-11-8--64_65_53;size= 7071

thus, i thought that size=7071 is the post clustering size for that OTU but before fastidious joining to this OTU:

F-11-2--19007_11562_11980;size=12628875

torognes commented 2 years ago

This looks odd. I do not think swarm will change the headers of the original sequences in the uc output files. The original header and abundance should be kept.

Which version of swarm are you running?

Could you show the entire command line?

dougwyu commented 2 years ago

swarm --threads 21 -d 1 -f -z \ Filter_min1PCRs_min1copies_S1_forusearch.fas \ -u Filter_min1PCRs_min1copies_S1_forusearch_swarm.uc \ --statistics-file Filter_min1PCRs_min1copies_S1_forusearch_swarmstats.txt \ --seeds Filter_min1PCRs_min1copies_S1_forusearch_swarm.fas

dougwyu commented 2 years ago

swarm-3.1.0-linux-x86_64

frederic-mahe commented 2 years ago

but i found that after running swarm -d, the uc output looks like this:

F-11-8--64_65_53;size=7071 F-11-2--19007_11562_11980;size=12628875

@dougwyu your swarm command line seems ok, but your UC output is strange (it should be an array, like in my comment above).

We have a set of tests for the UC output, and I've tried different things to replicate your issue, including trying different styles of line separators (Windows, Linux or macOS), but without any success so far.

Could you please try to make a small file, maybe with the two input sequences F-11-8--64_65_53 and F-11-2--19007_11562_11980? The idea is to make a minimal, shareable example we could use to understand what's happening.

frederic-mahe commented 1 year ago

@dougwyu please feel free to re-open this issue if need be.