ohnosequences / mg7

Configurable and scalable 16S metagenomics data analysis
https://goo.gl/y3rZFD
GNU Affero General Public License v3.0
3 stars 3 forks source link

Assignment to a set of taxas #68

Closed laughedelic closed 8 years ago

laughedelic commented 8 years ago

This depends on changing the reference databases (not picking the first mapping, but keeping them all)

laughedelic commented 8 years ago

This is done and should work both with single- and multiple-valued DBs 👌

laughedelic commented 8 years ago

One question here: now that one BLAST hit may be mapped to several taxas, how do we choose the BBH among these multiple assignments? the first? @eparejatobes @marina-manrique @rtobes

rtobes commented 8 years ago

The most specific (rank based). The problem is that the strains, that is the maximum level of specificity, are clasified in the taxonomic tree as no rank. In that case you could know that it is an strain if its ancestor is a taxon with species rank. Then you need to base the specificity on:

laughedelic commented 8 years ago

@rtobes and what if among these taxas, there is no any with these ranks? At the moment there's no any predefined order of ranks in Bio4j/MG7

rtobes commented 8 years ago

It is very simple. You can find taxa from any rank in the set of assignments for a RNA sequence. The order of the ranks is this: • 1 superkingdom • 2 kingdom • 3 superphylum • 4 phylum • 5 subphylum • 6 class • 7 subclass • 8 order • 9 suborder • 10 family • 11 subfamily • 12 tribe • 13 subtribe • 14 genus • 15 subgenus • 16 species group • 17 species subgroup • 18 species • 19 subspecies and no rank can be at any level

eparejatobes commented 8 years ago

Why not just take the LCA?

laughedelic commented 8 years ago

@rtobes thanks, now it's clear. But I agree with @eparejatobes about LCA, because there may be several specific taxas in the list, which one to choose?

rtobes commented 8 years ago

Because I don't trust in the sufficiently specific assignment of all of the taxonomic assignments that each reference sequence has. If we do that, probably, we will diminish the specificity of the assignments for the reference sequences. I agree with taking the LCA but only after an internal refining of the assignments based on consistency,..... We could do a test for knowing if the rank order value of the assignments for query sequences diminishes a lot using LCA in reference sequences with multiple assignment cases. The average of the rank order values for the set of assignments of each reference sequence could be a global value proportional to the specificity of the assignment before calculating the LCA. In the case of no rank we would have to assign an order_value = [rank of the closest ancestor]+1. Comparing this global value for each reference sequence before and after LCA we could have an idea about the loss of specificity using LCA for reference sequences with multiple assignment.

laughedelic commented 8 years ago

So I guess, I'm going to implement this ranking.

laughedelic commented 8 years ago

This is done. I'm going to merge it soon, but it needs testing, as the assignment code was significantly refactored. We could test it on some of the blast results that we already have.