torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

Option to output cluster member ids/sequence id? #179

Closed koh-joshua closed 8 months ago

koh-joshua commented 8 months ago

Is there an option to simply output members (sequence id) of a cluster to a file? At the moment, looks like the only way to know what are the member/sequences within a cluster is to output all fasta sequences within the cluster?

For example, if seq1_100 and seq2_400 are both members of cluster 1, how can I retrieve the ids (seq1_100 and seq2_400) from the cluster? It's easy to map the sequence id back to their fasta sequence.

I believe this is similar to what VSEARCH generates as one of the outputs.

frederic-mahe commented 8 months ago

hello @koh-joshua

It seems that swarm's default output corresponds to what you need:

printf ">seq1_100\nAA\n>seq2_400\nAC\n>seq3_10\nGG\n" | \
    swarm 2> /dev/null 
seq2_400 seq1_100
seq3_10

If you don't want clusters to printed on your terminal, you can redirect or name an output file with this command-line option:

-o, --output-file

output clustering results to filename. Results consist of a list of clusters, one cluster per line. A cluster is a list of amplicon headers separated by spaces. That output format can be modified by the option --mothur (-r). Default is to write to standard output.

koh-joshua commented 8 months ago

Thank you so much! Thank you for VSEARCH and SWARM!

frederic-mahe commented 8 months ago

Thanks @koh-joshua

This was already covered indirectly by our test suite, but I've added three specific tests for completeness (https://github.com/frederic-mahe/swarm-tests/commit/f0b5b734abb2fa240e2fa17a286e7b1c69643339)