torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

incomplete creation of microvariants (short sequences) in swarm 3.0 #125

Closed frederic-mahe closed 5 years ago

frederic-mahe commented 5 years ago

I am on a roll :-). This is covered by the test 53 in fixed_bugs.sh:

function microvariants() {
    local SEQ="${1}"
    local -i LENGTH=${#SEQ}
    for ((i=0 ; i<=LENGTH ; i++)) ; do
        ## insertions
        for n in A C G T ; do 
            echo ${SEQ:0:i}${n}${SEQ:i:LENGTH}
        done
        if (( i > 0 )) ; then 
            ## deletions
            echo ${SEQ:0:i-1}${SEQ:i:LENGTH}
            ## substitutions
            for n in A C G T ; do
                echo ${SEQ:0:i-1}${n}${SEQ:i:LENGTH}
            done
        fi
    done    
}

## produce a fasta set with a seed and all its first-level microvariants (with a few duplicates)
microvariants "ACGT" | \
    awk '{print ">s"NR"_1\n"$1}' | \
    swarm

swarm 2.2 reports only one OTU, as expected, while swarm 3.0 reports three OTUs and three representative sequences:

>s10_38
ACGT
>s30_1
ACGT
>s40_1
ACGT
torognes commented 5 years ago

Swarm 3 does not work correctly with non-dereplicated input for d=1, yet.

It complicates the algorithm if it needs to take the potential presence of exact duplicates into account, and may slow it down.

Would it be acceptable to just stop after detecting duplicates and ask the user to dereplicate first? Or should it handle duplicates properly?

Would any serious user run it without dereplicating first?

frederic-mahe commented 5 years ago

Hum, I see. I would need some time to think, but it might be OK to ask users for strictly dereplicated input files.

We would need a very clear error message though, maybe with a command-line example on how to dereplicate with vsearch.

Also, we may have to remove references to d = 0, which was a possibility with swarm 2.

torognes commented 5 years ago

Well, dereplication could still be done with Swarm 3 with d=0.

torognes commented 5 years ago

If there are duplicates, would it be okay to join them together under the id of one of them and summing their abundances?

For example if you had the following sequences:

>a_1
acgt
>b_5
acgt
>c_1
gtga

Would it then be okay to treat them as if they were like this:

>a_6
acgt
>c_1
gtga

Essentially, it would be like using the result of an internal dereplication as the input.

It may be possible to keep track of the names and abundances of the original sequences and report them as well in the various output formats.

frederic-mahe commented 5 years ago

yes, swarm 3 already does that when using -d 0:

printf ">a_1\nacgt\n>b_5\nacgt\n>c_1\ngtgat\n" | \
    swarm -d 0 -o /dev/null -l /dev/null -w -

The -o output gives the list of sequences in each cluster. Notice that the most abundant entry is the cluster representative, as expected.

torognes commented 5 years ago

Yes, my example should have been:

>b_6
acgt
>c_1
gtga
frederic-mahe commented 5 years ago

Do you think the -d 0 algorithm could be optimized? like you did for vsearch's dereplication to reduce its memory footprint?

frederic-mahe commented 5 years ago

The synopsis for -d 0 could be:

swarm -d 0 [-rz] [-a int] [-i filename] [-l filename] [-o filename] \
    [-s filename] [-u filename] [-w filename] [filename]

(exclude options -f, -n, and -t, in comparison with -d 1)

torognes commented 5 years ago

Yes, a similar dereplication algorithm as in vsearch could be included.

frederic-mahe commented 5 years ago

I've written unit tests for the -d 0 synopsis (https://github.com/frederic-mahe/swarm-tests/commit/717f1ca3def6cc5173258d20b6331433ea84aac2), along with valgrind tests (https://github.com/frederic-mahe/swarm-tests/commit/45ed000da3f315594626a6d6c73561ae5dc3e217), and I've updated swarm.1.

Now the only missing piece is an error message in swarm 3.0 indicating that some sequences are duplicated when d != 0. What do you think of this message?

Error: some fasta entries have identical sequences.
Swarm expects dereplicated fasta files.
Such files can be produced with swarm or vsearch:
    swarm -d 0 -w dereplicated.fasta input.fasta > /dev/null
    or
    vsearch --derep_fulllength input.fas --sizein --sizeout --output dereplicated.fas
When clustering the resulting dereplicated fasta file with swarm,
beware of the abundance annotation format ("_" or ";size=") .

The error should be reported as soon as possible. Can it be done during the fasta parsing process?

colinbrislawn commented 5 years ago

That's a great error message. Short and sweet, with examples of how to fix it. 👍

I'm not sure I understand this line... does swarm still support both formats?

beware of the abundance annotation format ("_" or ";size=") .

frederic-mahe commented 5 years ago

Thanks for yourfeedback @colinbrislawn

What I was trying to convey is the idea that vsearch and swarm use different abundance annotation formats (_ or ;size=) by default. For example, if dereplication is done with vsearch, then swarm clustering must use the -z option. Alternatively, vsearch cannot parse swarm's abundance annotation format (_).

I welcome any suggestion to make that error message clearer.

frederic-mahe commented 5 years ago

Maybe the abundance annotation format can be covered by another error message. Let's remove the last two lines then:

Error: some fasta entries have identical sequences.
Swarm expects dereplicated fasta files.
Such files can be produced with swarm or vsearch:
    swarm -d 0 -w dereplicated.fasta input.fasta > /dev/null
    or
    vsearch --derep_fulllength input.fas --sizein --sizeout --output dereplicated.fas
torognes commented 5 years ago

Good, but perhaps we should write ".fasta" and not ".fas", and abbreviate something else to fit within 80 characters:

Error: some fasta entries have identical sequences.
Swarm expects dereplicated fasta files.
Such files can be produced with swarm or vsearch:
    swarm -d 0 -w derep.fasta input.fasta > /dev/null
    or
    vsearch --derep_fulllength input.fasta --sizein --sizeout --output derep.fasta
frederic-mahe commented 5 years ago

or fold the code and reduce indentation?

Error: some fasta entries have identical sequences.
Swarm expects dereplicated fasta files.
Such files can be produced with swarm or vsearch:
  swarm -d 0 -w dereplicated.fasta input.fasta > /dev/null
  or
  vsearch --derep_fulllength input.fasta \
            --sizein --sizeout --output dereplicated.fasta
torognes commented 5 years ago

Commit 2b9e97e804bbe8958f4f435c2fef1357c9499cd3 in the zobrist branch (Swarm 3) now gives a fatal error message if the input is not properly dereplicated.

torognes commented 5 years ago

Zobrist branch renamed to swarm3.

frederic-mahe commented 5 years ago

Good. I now realize that there is a fatal error only when -d 1.

printf ">s1_2\nA\n>s2_1\nA\n" | swarm

For higher d values, there is only a warning:

printf ">s1_2\nA\n>s2_1\nA\n" | swarm -d 2

What do you think? Should we enforce dereplication for higher d values too?

colinbrislawn commented 5 years ago

I think forcing dereplication is a good idea, as it prevents ambiguity about what will happen with the sequences. And the consistent pipeline makes the error messages easier to write.


Heck, you could make things really clear by requiring vsearch style size annotations and throwing an error if none are found. While verbose, the usearch / vsearch size annotations are impossible to misinterpret, and depreciating the _size annotations would make pipelines fasta files totally unambiguous about sizes.

For example, qiime 1 used this in the fasta headers:

sampleName_readNumberFromThatSample

But this is really clear:

anything_anythingelse; size=145;


EDIT: #131 would have been solved by only accepting size=x; annotations.

torognes commented 5 years ago

I think dereplication should be enforced for d>1 as well, so I'll change the code accordingly. It seems inconsistent to make it different for d=1 and d>1. The results could be bad in both cases.

torognes commented 5 years ago

Commit bb860d3f12b41b0a06e62cf2c331d811bddeb355 enforces strictly dereplicated sequences for both d=1 and d>1.

frederic-mahe commented 5 years ago

Perfect. This is now covered by unit tests.

frederic-mahe commented 5 years ago

@colinbrislawn When we started swarm, other clustering tools (well, at least DNAclust) used _INT for abundance values, and ;size=INT was not as dominating as it is now. We could mark _INT as deprecated in future versions of swarm, and switch to ;size=INT exclusively for swarm v4. But I think making dereplication mandatory in swarm v3 might be enough change for now.

colinbrislawn commented 5 years ago

Thanks for telling me about the history of this. I understand now!