peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

Store deeper taxonomy? #597

Open peterjc opened 4 months ago

peterjc commented 4 months ago

IF we want to store a deeper taxonomy, do we store a flexible tree (e.g. importing some/all of the NCBI taxonomy), or do we store a restricted hierarchy? If restricted, which levels?

USEARCH as per https://drive5.com/usearch/manual/tax_annot.html stores up to 8 levels:

Based on https://pythonhosted.org/OBITools/attributes.html and https://pythonhosted.org/OBITools/scripts/ecodbtaxstat.html OBI tools focused on the NCBI taxid, but recorded multiple fields including order, family, genus and species in its output.

DADA2/RDP looks flexible, e.g. https://benjjneb.github.io/dada2/assign.html has a training set where the levels are Kingdom, Phylum, Class, Order, Family, Genus - while according to https://benjjneb.github.io/dada2/training.html PR2 used "Kingdom", "Supergroup", "Division", "Class", "Order", "Family", "Genus", "Species".

Example from NCBI taxonomy (note two high level clade entries, and no class level):

Some references have Oomycetes aka Oomycota as class level. Their placement has been somewhat fluid in the last decades.

Another example, the potato psyllid, Bactericera cockerelli

No domain level on the NCBI taxonomy, but all the rest of the USEARCH ranks and plenty more.

Storing a fixed set of levels will inevitably be an imperfect to be a one-size-fits best effort, but makes certain operations much easier. We need some strong use cases to decide (e.g. do we want to attempt LCA resolution, or do we want to offer classification rules with cascading matches for different ranks?)