Closed dougwyu closed 1 year ago
hello @dougwyu
swarm accepts either _[1-9][0-9]*$
or ;size=[1-9][0-9]*;?$
abundance annotations (using regular expression notation), not both. This is controlled by the option -z
.
When using the default abundance annotation (_[1-9][0-9]*$
), if there are several pattern occurrences (e.g.; s_64_65_53
), then only the last one is used _53
. Swarm does not perform a sum of the successive patterns.
Also, note that the uc output gives you the number of hits in your cluster (line C, column 3), not the sum of all abundances in your cluster (which would be 9 + 1 = 10 in this toy-example):
printf ">s1_9\nAA\n>s2_1\nAT\n" | \
swarm -d 1 -f -u - -o /dev/null -l /dev/null
C 0 2 * * * * * s1_9 *
S 0 2 * * * * * s1_9 *
H 0 2 50.0 + 0 0 2M s2_1 s1_9
Regarding the size=7071
and size=12628875
values in your example, I would need to see your original input to know why swarm is outputting these particular values.
sorry, i was a bit unclear.
in my fasta headers, i am assuming that swarm is ignoring everything before the semicolon and only takes size= 182 as the sequence's abundance (given that i'm using -z)
the input sequence header looked like this:
F-11-8--64_65_53;size= 182
but i found that after running swarm -d, the uc output looks like this:
F-11-8--64_65_53;size=7071 F-11-2--19007_11562_11980;size=12628875
somehow,
F-11-8--64_65_53;size= 182
was changed to
F-11-8--64_65_53;size= 7071
thus, i thought that size=7071 is the post clustering size for that OTU but before fastidious joining to this OTU:
F-11-2--19007_11562_11980;size=12628875
This looks odd. I do not think swarm will change the headers of the original sequences in the uc
output files. The original header and abundance should be kept.
Which version of swarm are you running?
Could you show the entire command line?
swarm --threads 21 -d 1 -f -z \ Filter_min1PCRs_min1copies_S1_forusearch.fas \ -u Filter_min1PCRs_min1copies_S1_forusearch_swarm.uc \ --statistics-file Filter_min1PCRs_min1copies_S1_forusearch_swarmstats.txt \ --seeds Filter_min1PCRs_min1copies_S1_forusearch_swarm.fas
swarm-3.1.0-linux-x86_64
but i found that after running swarm -d, the uc output looks like this:
F-11-8--64_65_53;size=7071 F-11-2--19007_11562_11980;size=12628875
@dougwyu your swarm command line seems ok, but your UC output is strange (it should be an array, like in my comment above).
We have a set of tests for the UC output, and I've tried different things to replicate your issue, including trying different styles of line separators (Windows, Linux or macOS), but without any success so far.
Could you please try to make a small file, maybe with the two input sequences F-11-8--64_65_53
and F-11-2--19007_11562_11980
? The idea is to make a minimal, shareable example we could use to understand what's happening.
@dougwyu please feel free to re-open this issue if need be.
i have a question about the size= information.
I ran swarm with -d 1 and -f options, and i get query & centroid information in the *.uc file as follows:
F-11-8--64_65_53;size=7071 F-11-2--19007_11562_11980;size=12628875
F-11-8--64_65_53 is parsed as SAMPLE--PCR1-PCR2-PCR3, where PCR1|2|3 is the number of reads of that haplotype in each sequence. In the input fasta, size = sum(PCR1+PCR2+PCR3). So size= should still be 64+65+53 = 182, not 7071
but 7071 ≠ 12628875 either, so i'm trying to figure out where 7071 comes from.
Is 7071 the cluster size after initial clustering but before fastidious clustering?