feature request: assume abundance = 1 when missing from sequence names

nhoffman commented 9 years ago

Hi -

I'd like to suggest that assuming abundance = 1 when abundance annotation is missing from sequence names would make it much easier to use swarm itself to perform strict dereplication. The readme suggests that "Swarm can perform a dereplication for you (with options -d 0 -w)" - but this isn't exactly true (at least in the way I'd expect), since swarm -w dereplicated.fasta -d 0 amplicons.fasta exits with the error "Abundance annotation not found..." in the absence of abundance information (at least with version 2.1.1). Without this hard requirement for abundance annotation, swarm could be used in two passes without any external tools.

Thanks! Noah

torognes commented 9 years ago

Thanks for the suggestion. I think it sounds like a good feature to add. Will consider including it in the next release.

deprekate commented 9 years ago

I just came in to create this exact request. I spent the last while trying to figure out the "Error: Abundance annotation not found" error. Then figured out I can make it stop if I add the _1 to the end of each read header.

And yes the ./swarm amplicons.fasta in the "Quick Start" section, will only work if your fasta file is formatted in the correct format, which is misleading.

colinbrislawn commented 9 years ago

@frederic-mahe What is the relationship between swarm and vsearch, moving forward? Why duplicate functionality?

I understand that both vsearch and swarm offer competing / complementary / different clustering algorithms. This reflects fundamentally different OTU construction approaches of @frederic-mahe and Robert Edgar (who is not on github).

However, swarm and vsearch also offer identical functionality (dereplication), which is causing this complication.

Have you considered making swarm match the spec. of usearch / vsearch for size annotations? Have you considered making vsearch a dependency of swarm? (You tell people to dereplicate using vsearch and remove / depreciate dereplication with swarm? After all, vsearch now offers fast and flexible dereplication options.)

torognes commented 9 years ago

@colinbrislawn, just a quick note: if you use the -z option in Swarm, it will accept abundance values in usearch format (e.g. fasta headers like this: ">abc;size=123").

colinbrislawn commented 9 years ago

I have used the -z option with Swarm because I always use usearch size annotations.

I guess my suggestion is to make Swarm use usearch size annotations on all inputs and outputs, removing confusion between vsearch and Swarm.

frederic-mahe commented 9 years ago

Hi @nhoffman, hi @deprekate,

you are right, the documentation is misleading. I've uploaded a new version of the README and I've made a few adjustments to the manpage too. I hope it is clearer now. You probably figure it out, but you can very easily add fake abundance annotations to your fasta headers using the ubiquitous sed command:

sed '/^>/ s/$/_1/' input.fasta > output.fasta

(replace _1 with ;size=1 if you are using the -z option)

I am not sure swarm should assume abundances of 1 when abundance annotations are missing. Abundance annotations are really important for swarm, and I'd rather spend hours trying to clarify the documentation than let users run swarm unknowingly and get weird results.

Writing good documentation is very difficult. Strangely, it is even more difficult if you created the tool (and English is not your mother tongue). That's why your feedback is deeply appreciated.

Thanks,

frederic-mahe commented 9 years ago

Hi @colinbrislawn,

Have you considered making swarm match the spec. of usearch / vsearch for size annotations?

Swarm predates vsearch. The abundance annotation format we first used in swarm is the same than DNAclust (_INTEGER). At the time, it was the best open source solution available for clustering, and it seemed logical to adopt the same format. Meanwhile, usearch became dominant and we added the possibility to use usearch annotations (;size=INTEGER).

If you suggest we made the >abc;size=123 annotation the default one, then yes it might indeed remove some confusion. A solution to avoid a change in behavior might be to automatically detect the ;size=INTEGER pattern in swarm v3. The -z option would become redundant and could be marked as deprecated. Latter, it could be eliminated completely.

What is the relationship between swarm and vsearch, moving forward? Why duplicate functionality? [...] However, swarm and vsearch also offer identical functionality (dereplication), which is causing this complication.

The -d 0 option was not really intended to be used as the main way to dereplicate sets of sequences. It was intended as a logical and natural extension of the d parameter: what happens if I allow zero differences among my sequences? Well, then it would only group strictly identical sequences. Similarly to what you would expect if you'd run vsearch with a clustering threshold of 100%.

I removed the statement about dereplication with -d 0 from the README (I kept it in the man page for more advanced users).

Have you considered making vsearch a dependency of swarm? (You tell people to dereplicate using vsearch and remove / depreciate dereplication with swarm? After all, vsearch now offers fast and flexible dereplication options.)

Yes, we can show how to use vsearch to quickly prepare data for swarm. I will do that when the new version of vsearch will be published (soon™). Command line examples will be rewritten to use vsearch. I will also make a wiki page presenting the pipeline I use for my own analyses: from FASTQ files to OTU tables, using pear, cutadapt, vsearch, swarm and some python scripts.

Thanks,

nhoffman commented 9 years ago

Hi @frederic-mahe - thanks for the follow up. I agree that it's trivial to modify the sequence names, and the clarification in the docs will help other new users. Using vsearch for dereplication solves the problem in any case.

One parting comment on the format of abundance annotation (not at all a feature request, just my perspective): it appears that the community is fine with modifying sequence ids to provide annotation, and I don't expect that this will change. However, modifying sequence names at all has always made me uneasy, as the use of such annotations relies on assumptions about name formats and makes data provenance a bit more challenging. Also, on occasion I find that one tool or another objects to names that are too long, or those that include semicolons or some other non-alphanumeric character, or that wants to modify names further for its own purposes.

I prefer the approach taken by pplacer, which is to provide abundance annotation in a separate file in csv format (http://matsen.github.io/pplacer/generated_rst/guppy_redup.html#guppy-redup). This has the advantage of minimizing the need for text manipulation, preserves sequence names, and provides a mechanism for annotating the results of dereplication following sample pooling (I'm actually a bit stumped about how to dereplicate pooled specimens otherwise - a question that I'll address in a separate issue).

Best regards, Noah

deprekate commented 9 years ago

I would also agree the problems of changing the sequence ID, which can cause lots of problems with pipeline workflows, when a constant ID is needed. Not to mention provenance. Also the fact that paired end data require /1 /2 to be at the end rather than _X can potentially mess up things downstream.

Before I figured out the sed 's/ /_1 /' trick, I was thinking I needed an abundance file, which I was imagining creating with:

grep "^>" INFILE.fna | cut -d" " -f1 | sed 's/$/\t1/'  > ABUNDANCES.tsv

But if you were to incorporated the ability of an abundance file into SWARM , then you need to add command line options and requiring the correct format (tabs/commas/equal signs/etc), so a separate file might not be the best.

As an alternative might I suggest a single command line option to allow for missing abundance information, and then to set those missing abundance to INTEGER?

  -a, --append-abundance INTEGER           use this value for any missing abundance values (1)

This would allow for the behavior both mhoffman and I were expecting, without affecting any current functionality (it would also only take a couple lines of code). And this way users only get weird results if they explicitly force them.

(also there is a typo in the Quick Start README.md: "were sequence identifiers are unique" should be "where sequence identifiers are unique")

frederic-mahe commented 9 years ago

Hi @deprekate, thanks for your suggestion.

-a, --append-abundance INTEGER           use this value for any missing abundance values (1)

I think it is an interesting solution (good for advanced users, and no hidden behavior for newbies). What do you think @torognes? The value would have to be a positive integer (maybe limited to a long int).

Here is an approach I often use and for which that new option would be useful. I often pool environmental sequences with reference sequences (trimmed with the same primers). By definition, reference sequences don't have abundance values, so I have to add fake values. When doing so, one can aim for two slightly different things:

check if some clusters of environmental sequences contain reference sequences (you just need your reference sequences to have a fake abundance of 1),
force some of the environmental sequences to clusterize around the reference sequences (you need the reference sequences to have a fake abundance value higher than any environmental sequences).

In the second approach, reference sequences always act as cluster seeds, whereas in the first approach reference sequences can act as seeds only in very small clusters.

PS: thanks for the typo. I pushed a correction.

colinbrislawn commented 9 years ago

@frederic-mahe

force some of the environmental sequences to clusterize around the reference sequences (you need the reference sequences to have a fake abundance value higher than any environmental sequences).

Interesting! Using Swarm for 'open-reference' OTU clustering. The qiime devs will love this!

frederic-mahe commented 9 years ago

Hi @torognes, maybe we could make this error message more informative for users?

Error: Abundance annotation not found for 5 sequences

Maybe we could add the following sentences?

Error: Abundance annotation not found for 5 sequences
Fasta headers must end with abundance annotations (_INT or ;size=INT)
Abundance annotations can be produced by dereplicating the sequences

Please feel free to improve wording.

colinbrislawn commented 9 years ago

@frederic-mahe

I think your wording is excellent!

Error: Abundance annotation not found for 5 sequences Fasta headers must end with abundance annotations (_INT or ;size=INT) Abundance annotations can be produced by dereplicating the sequences

Alternatively, you could have swarm throw this error the first time it encounter a read without size annotations.

Error: Abundance annotation not found for sequence at line 1130 Fasta headers must end with abundance annotations (_INT or ;size=INT) Abundance annotations can be produced by dereplicating the sequences

This would catch the use case where a user concatenates two fasta files, only one of which is annotated. It stops swarm from continuing to read a large, unannotated file.

torognes commented 9 years ago

This issue should be fixed in the swarm version 2.1.2. I have added the -a/--append-abundances optio as suggested. Automatic detection of abundance format (_INT or ;size=INT) is not performed. If an abundance value is missing an improved error message is now shown indicating the number of sequences missing abundance information as well as the line number of the first instance.

deprekate commented 9 years ago

Great. Works well with the -a option. One question though, if I use the -a 1 flag, but a sequence id happens to have an underscore and then digits before the whitespace, will the underscore and digits take precedence over the -a 1?

as an example in one of my test files the sequences look like:

>SRR059992.1_HWI-6X_10148_FC61V2D_8_1_1153_1100
ACTGACTGACTGACTG

If by chance the colons were all underscores, would the above sequence be seen as abundance 1100, even if the -a 1 flag was used?

torognes commented 9 years ago

Yes, if there is an underscore and some digits at the end of the header, those digits will be used. However, in this case, if you specified the -z option (for usearch style abundance annotation), no annotation would be found and the value specified with the -a option should be used.

nhoffman commented 9 years ago

Thanks for being so responsive to this request! I like the solution that you settled on.

torognes commented 9 years ago

Thanks

nhoffman commented 9 years ago

A minor suggestion: -a/--append-abundance should fail with an error message if not provided an argument. The current behavior (for users not paying attention like myself!) is to silently consume the other arguments (eg, swarm -a -w somefile.fasta uses somefile.fasta as the input file).

torognes commented 9 years ago

Thanks for the suggestion! Good idea. Checks of all numeric option arguments have now been added. A fatal error will occur if a numeric argument is missing.

torognes / swarm

feature request: assume abundance = 1 when missing from sequence names #59