torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Seeds not in same order as clusters in output files #175

Closed andnischneider closed 1 year ago

andnischneider commented 1 year ago

Hi,

I am currently running swarm v 3.1.3 and have previously (swarm v2.xx) used the sequences from the seeds output file (--seeds) to assign taxonomy to my clusters (--output-file), which was made easy by them being in the same order, i.e. the first sequence in the seeds file would be the first line in the cluster result file. Now it seems like the order is not the same between the two files anymore, or am I missing something? I am attaching my input file (asvs_clean.fa), the resulting output files (results.txt; seeds.fa), and this was my command:

swarm -d 3 --output-file swarm/results.txt --seeds swarm/seeds.fa swarm/asvs_clean.fa -s swarm/stats.txt Looking at the third sequence in the seeds fasta file, it corresponds to cluster 7 (line 7) in the results file, and not the third cluster in the results file (like I expected from previous swarm behaviour). Is this intended? swarm313.zip

Cheers Andreas

frederic-mahe commented 1 year ago

Hi @andnischneider

yes, the seed output behaves differently in swarm 3.

Clusters are written to output files (specified with -i, -o, -s, -u and -w) in the order they are generated, with the exception of the seeds (-w). Seeds are sorted by decreasing abundance, and then by alphabetical order of the sequence labels.

That change was introduced in 2019 (see issue #133). Sorting seeds by decreasing abundance values is important for some swarm-based pipelines, so reverting back to the old behavior is unlikely.

If you need clusters and seeds to be in the exact same order, may I suggest to sort the cluster file according to the seed order? Here is a relatively fast bash script:

SEEDS="seeds.fa"
RESULTS="results.txt"
SORTED_RESULTS="results_sorted.txt"

sed \
    --silent \
    '/^>/ { s/^>/^/ ; s/[1-9][0-9]*$/[1-9][0-9]*/ ; p }' \
    "${SEEDS}" | \
    while read PATTERN ; do
        grep \
            --max-count=1 \
            --word-regexp "${PATTERN}" \
            "${RESULTS}"
    done > "${SORTED_RESULTS}"
andnischneider commented 1 year ago

Hi @frederic-mahe ,

Alright, thank you for clarifying that this is indeed a wanted behavior and not a bug. And thanks for the script, I will either use that or find a similar approach in R.