torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Abundance annotation not recognised #90

Closed a1an77 closed 7 years ago

a1an77 commented 7 years ago

Hi, when using the following header I get an error stating " Abundance annotations not found...".

>AB022186.1.610 Archaea...Methanosphaera;Methanosphaera sp. R6_400

There seem to be an issue in the recognition of it, even though it ends with an abundance annotation of 400

Swarm 2.1.9 [Nov 22 2016 23:35:56]

a1an77 commented 7 years ago

Replacing spaces with _ seems to solve the problem so it is just a matter of handling spaces in the string splitter

torognes commented 7 years ago

Swarm will first split the header line at the first space. The initial part is considered the sequence identifier, while the rest is considered an optional description. The abundance has to be at the end of the identifier, before the space. By default there has to be an underscore ("_") before the number indicating the abundance. If you specify the "-z" option to Swarm, it will recognise abundance information in the usearch style, that is, with ";size=123;" in a part (preferably at the end) of the identifier (where 123 is the abundance).

The use of a space to separate the important identifier from the rest of the description is fairly common. We could add an option to not split the header at the first space, similar to the "-notrunclabel" in usearch and vsearch.

a1an77 commented 7 years ago

The issue seems to be more complicated when we also consider the output: it looks like the clusters are outputted as a single line where identifiers are separated by spaces. In that case an identifier with a space would make it impossible to retrieve a correct list of identifiers from the output

a1an77 commented 7 years ago

I also missed the definition in the manual (*), indeed spaces have to be dealt with before sending sequences to swarm, I guess a mapping is then needed as a pre-processing step.

(*) The amplicon identifier is defined as the string comprised between the ">" symbol and the first space or the end of the line, whichever comes first.

a1an77 commented 7 years ago

Maybe a warning that there are spaces in headers which could cause problems would be enough to catch the attention of a new user not reading the manual thoroughly in case her "identifiers" already end with "_N" by chance, with N not being an abundance value. Otherwise this can just be closed as a non-issue IMO

frederic-mahe commented 7 years ago

Hi, we already have the following error message:

Error: Abundance annotations not found for 1 sequences, starting on line 1.
Fasta headers must end with abundance annotations (_INT or ;size=INT).
The -z option must be used if the abundance annotation is in the latter format.
Abundance annotations can be produced by dereplicating the sequences.

It is already quite clear, but maybe a way to improve it would be to show to the user how swarm sees the first faulty header. For example:

Error: Abundance annotations not found for 1 sequences, starting on line 1.
>AB022186.1.610
Fasta headers must end with abundance annotations (_INT or ;size=INT). A Header 
is defined as the string comprised between the ">" symbol and the first space
or the end of the line, whichever comes first. The -z option must be used if the
abundance annotation is in the ;size=INT format. Abundance annotations can be
produced by dereplicating the sequences.

The definition of a header and how it is parsed can also be added to make a very comprehensive error message.

colinbrislawn commented 7 years ago

show to the user how swarm sees the first faulty header

👏 That provides the kind of 'just in time teaching' that empowers users to fix this problem.

torognes commented 7 years ago

The error message has been improved as suggested.