torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Remove abundance from identifiers in output file #91

Closed a1an77 closed 7 years ago

a1an77 commented 7 years ago

Hi, since the _abundance part of the input identifier is swarm specific there should be an option to have it removed from the output to have original identifiers back.

IMO it should even be the default behaviour to remove them and the option should allow keeping them for backward compatibility

frederic-mahe commented 7 years ago

Thanks for you input @a1an77, we may consider the change you are suggesting in a future version. Please note that including _ABUNDANCE or ;size=ABUNDANCE in output is not swarm-specific, other clustering tools do the same.

Meanwhile, you can get what you want with a simple command:

sed -r 's/;size=[0-9]+;//g' in.swarm > out.swarm
# or inplace
sed -ir 's/;size=[0-9]+;//g' in.swarm

You can even pipe swarm's output and avoid a temporary file:

swarm | sed -r 's/;size=[0-9]+;//g' > in.swarm
frederic-mahe commented 7 years ago

Abundance annotations are not swarm-specific. Other operations such as de novo chimera detection with uchime also require abundance annotations. We would like to keep swarm as lean as possible, therefore we are not going to add a specific option for abundance annotation removal. Hopefully, the simple shell commands presented above will cover most user cases. It is also possible to strip abundance annotations with vsearch and the --xsize option.

Added to the frequently asked questions.