torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
655 stars 122 forks source link

Sintax taxonomy classifier #210

Open davidealbanese opened 7 years ago

davidealbanese commented 7 years ago

Dear @torognes, are you planning to add the sintax classifier http://biorxiv.org/content/early/2016/09/09/074161?

Thank you, Davide

torognes commented 7 years ago

Thanks for the suggestion. We might add some kind of taxonomic classifier to VSEARCH in the future, but there are no firm plans at the moment.

lanzen commented 6 years ago

Agree that this would be very useful! We have recently developed a SINTAX formatted version of the SilvaMod database based on Silva and part of CREST (https://github.com/lanzen/CREST/tree/master/LCAClassifier). Unfortunately, it cannot be used without a 64-bit license of usearch since it is too large.

GeoMicroSoares commented 6 years ago

I'll second this! Think of the opportunities now that Nanopore sequencing is booming.

chiras commented 6 years ago

@torognes any updates in that regard? would be great to have a hierarchic classification procedure likewise to utax/sintax.

torognes commented 6 years ago

I still agree that this would be very useful to include and one of the top features to prioritise, but I do not know when I can find time to implement it.

GeoMicroSoares commented 6 years ago

We'll be waiting or hopefully someone can contribute useful code meanwhile! :) Thanks @torognes

torognes commented 6 years ago

Due to popular demand, I have implemented the sintax command for taxonomic classification.

The sintax command has been added with --sintax_cutoff and --tabbedout options.

It implements the Sintax algorithm as described in Robert Edgar's preprint:

Robert Edgar (2016) SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences BioRxiv, 074161 doi: https://doi.org/10.1101/074161

Further details: https://www.drive5.com/usearch/manual/cmd_sintax.html

Multithreading is supported. Databases in UDB files are supported. Strand option may be specified.

This is a new feature that has been only very briefly tested. Feedback is therefore highly welcomed!

davidealbanese commented 6 years ago

Great news, thanks! I will test it soon...

Andreas-Bio commented 6 years ago

There are some issues with the (original) sintax command that are prohibiting its use for me (and potentially others):

-the self testing is very unflexible (can only self test the whole database at once against the whole database using LOOCV); instead of LOOCV with selected sequences only -the algorithm only outputs the first hit irrespective of the hits after that, so it may be an ambiguous hit (top hits identical) and this may not become fully clear, a hit list like BLAST would be much more transparent -the sintax algorithm is vulnerable to an inconsistent number of sequences per species, if species A has 15 sequences and species B has 1 sequence and that one sequence is identical to species A, than species A will be on the output but almost never species B if the database is queried with species B -the sintax algorithm is forced length-sensitive, this is unwelcome if there are a lot of partial sequences in the database, as a shorter sequence which is identical (a subset) of a longer sequence will be discriminated against in the results

If you are re-developing the sintax algorithm maybe some of these issues could be resolved very easily.

chiras commented 6 years ago

Thank you very much, absolutely appreciated!

lanzen commented 6 years ago

Tusen takk Torbjørn,

Detta kommer helt klart å være en viktig resurs for meg, spesielt siden gratisversionen av SINTAX ikke en gang klarer en database like stor som nyeste NR-versionen av SILVA.

Vennlig hilsen, Anders

On Fri, Mar 2, 2018 at 2:17 PM, Alexander Keller notifications@github.com wrote:

Thank you very much, absolutely appreciated!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/torognes/vsearch/issues/210#issuecomment-369918270, or mute the thread https://github.com/notifications/unsubscribe-auth/AHCkUQRa9sUlGD7P-zh2H-LdFUutngVeks5taUZ1gaJpZM4Kg2_T .

torognes commented 6 years ago

Thanks for your feedback.

So far I have just tried to implement the SINTAX algorithm as described in the preprint. I understand that there is some disagreement about the quality of the algorithm and some issues have been raised. I will look more into these and see if it is possible to improve it or to implement a different algorithm. Please tell me if you have any specific ideas for improvement.

Andreas-Bio commented 6 years ago

Thanks for your reply. I think it is important to have an output like BLAST ( https://www.drive5.com/usearch/manual/blast6out.html ) , where you have a list of hits. This enables the user to make an informed decision and at the same time LOOCV is much easier. Additionally it will show ambiguous (identical k-mers between query and hit) hits immediately (rather than having to guess, the situation at the moment). It would make it also much easier to see contaminating sequences (if the whole list is family x but one hit is family y).

Something like: 1) label_query 2) label_hit 3) length_query 4 length_hit 5) percent_similarity_kmers_query_and_hit 6) bootstrap value 7) number of kmers that are not identical ...? Most of these number must be internally available, and if not, they should be able to be extracted with one line of code.

Leo-alves commented 5 years ago

Hey there. Just to drop my five cents... Have been working myself on improving taxonomic classification using Vsearch (not further sintax algorithm) and I am stuck on that same issues as andzandz11 mentioned. Deep-level classification, my case species, are at many times just impossible to define, also because of the 16S variant regions we sequence. I am working with a highly curated database which I'am relativelly confident there is no high level of misannotation. Still, for a lot of sequence variants I came across dubious taxonomies, such as the following:

A previously SV was assigned as Bacillus anthracis by vsearch with –maxaccepts 1 (disclaimer: at that point I run vsearch implemented in Qiime2 workflow). I then rerun it against the same db with vsearch outside qiime: $ vsearch --usearch_global sequence_variant.fa --db db --id 0.99 --blast6out out --maxaccepts 50000 From that I got 811 >=99% id hits. Then, I ranked the taxonomies by percentage of that 811 hits. Taxonomy | percentage D_5Bacillus;D_6Bacillus_cereus | 36,7 D_5Bacillus;D_6Bacillus_sp. | 35,4 D_5Bacillus;D_6Bacillus_thuringiensis | 12,1 D_5Bacillus;D_6Bacillus_anthracis | 5,9 D_5Bacillus;D_6Bacillus_mycoides | 3,6 D_5Bacillus;D_6Bacillus_subtilis | 1,1 D_5Bacillus;D_6Bacillus_pseudomycoides | 1,0 D_5Streptococcus;D_6Streptococcus_pneumoniae | 0,9 D_5Bacillus;D_6Bacillus_toyonensis | 0,7 D_5unclassified_Bacillaceae;D_6unclassified_Bacillaceae | 0,6 D_5Enterobacter;D_6Enterobacter_cloacae | 0,5 D_5Brevibacillus;D_6Brevibacillus_brevis | 0,2 D_5Bacillus;D_6Bacillus_samanii | 0,2 D_5Bacillus;D_6Bacillus_gaemokensis | 0,2 D_5Staphylococcus;D_6Staphylococcus_sp. | 0,1 D_5Staphylococcus;D_6Staphylococcus_aureus | 0,1 D_5Paenibacillus;D_6Paenibacillus_sp. | 0,1 D_5Bacillus;D_6Bacillus_pumilus | 0,1 D_5Bacillus;D_6Bacillus_marcorestinctum | 0,1 D_5__Bacillus;D_6__Bacillus_amyloliquefaciens | 0,1

Turns out what was B. anthracis looks like more for B. cereus. However, one should consider that a few species actually are so so similar to B. cereus that we have B. cereus group. So, summing all the % of B. cereus group members’ I got 53% B. cereus group. Then, we have the number of taxonomies for each given species in the databank. If there is an enrichment of a taxonomy (exactly like he mentioned) the output tends to deviate to that assignment when ranking. And that is partly happening to this example, because I have 431 B. cereus x 73 B. anthracis seqs in the db. I told partly because 36.7% of 881 = 323 and 5.9% = 52, but 323 is 75% of 431 while 52 is 71% of 73. I actually see a tendency on that, where the %of hits from B. cereus, B. thuringiensis, B. anthracis and B. micoydes are 75-73-71-68%. Well, at the end I would consider this sequence as “B. cereus group” and not B. thuringiensis. I also noticed that the top hits in a blast6output are reported in order of their positions in the database. That makes sense as vsearch finds an alignment and output it. However, for a sequence similar to more than one species (but not ultra-similar like the B. cereus group) that means the forst one present in the db will likelly be assigned. Because of that a classification system that consider what is related to the top hit fails, at least in my case. So, what I am doing after ranking is detecting the groups of species I know are closelly related and putting them toghether, as the B. cereus group mentioned. But, if a member of the group (excluding B. cereus) is highly ranked, eg.:60%+, I then accept the taxonomy. Another serious problem is that for many samples that were sequenced from an isolated bacteria whose species is know by us still return dubious taxonomies, like 50-50% taxA-taxB, being taxA the correct one and for that I just don’t know what to do.

torognes commented 4 months ago

I have made several improvements to the sintax command in vsearch 2.28.1, just released. Please see issue #535 or the release notes for details.