torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
656 stars 122 forks source link

Status taxonomy assignment? #73

Closed mdehollander closed 8 years ago

mdehollander commented 9 years ago

Hi,

I read at that you are working on a way to assign taxonomy to reads with vsearch: https://github.com/torognes/swarm/issues/45 and https://github.com/torognes/vsearch/issues/34. What is the status? Would it also do a LCA or something like that?

It would also be very nice if it would work with the SILVA database.

frederic-mahe commented 9 years ago

Hi MaTIZ,

I've been experimenting with it and it works quite well. however, the way I do it is rather specific. I am working on 18S rRNA (so eucaryots, not bacteria) and I use the database PR2. The advantage of PR2 is that it offers reference sequences clipped using the primers I am working with. I then use vsearch to compute global pairwise alignments with simple definition of identity (--iddef 1). Here is the command line I use:

"${VSEARCH}" --usearch_global "${QUERIES}" \
    --threads "${THREADS}" \
    --dbmask none \
    --qmask none \
    --rowlen 0 \
    --notrunclabels \
    --userfields query+id1+target \
    --maxaccepts 0 \
    --maxrejects "${MAXREJECTS}" \
    --top_hits_only \
    --output_no_hits \
    --db "${SUBJECTS}" \
    --id "${IDENTITY}" \
    --iddef 1 \
    --userout "${ASSIGNMENTS}" > "${NULL}" 2> "${NULL}"

It outputs a table of best hits, which can be parsed to compute the last common ancestor (I wrote a python script for that). If you want to do the same with the Silva database, you might want to use a different identity definition that gives less weight to terminal gaps (see --iddef in the documentation).

colinbrislawn commented 9 years ago

Are you searching your database using your swarm centroids or all sequences in each swarm/OTU?

It outputs a table of best hits, which can be parsed to compute the last common ancestor.

Something like finding most specific taxonomic level all hits share? There are different ways of finding the LSA and the hits would be greatly changed by your search parameters...

frederic-mahe commented 9 years ago

I used to do the taxonomic assignment for all my amplicons and use that information as a way to check the quality of the OTUs produced by swarm. I now completely trust swarm results, and I only assign OTU representatives. I use a simple approach to compute the last common ancestor: strict consensus. If an amplicon is equally distant to references assigned to genus/speciesA and genus/speciesB, it will be assigned to "genus". That strict approach puts all the pressure on the quality of the reference dataset.

frederic-mahe commented 8 years ago

Torbjørn and I decided that the taxonomy assignment with vsearch is not a priority. vsearch can already perform fast environmental-reference sequences comparisons. Then, a script can parse the results an compute LCA in a flexible way (reference databases can be in multiple formats). Besides similarity based assignments, there are other solutions such as naive Bayesian classifiers, phylogenetic placement for which fast open-source solutions are available.

colinbrislawn commented 8 years ago

:+1:

Then, a script can parse the results an compute LCA in a flexible way (reference databases can be in multiple formats).

For reference, here are some programs which can perform this.