Store deeper taxonomy? - Githubissues

IF we want to store a deeper taxonomy, do we store a flexible tree (e.g. importing some/all of the NCBI taxonomy), or do we store a restricted hierarchy? If restricted, which levels?

USEARCH as per https://drive5.com/usearch/manual/tax_annot.html stores up to 8 levels:

k Kingdom
d Domain
p Phylum
c Class
o Order
f Family
g Genus
s Species

Based on https://pythonhosted.org/OBITools/attributes.html and https://pythonhosted.org/OBITools/scripts/ecodbtaxstat.html OBI tools focused on the NCBI taxid, but recorded multiple fields including order, family, genus and species in its output.

DADA2/RDP looks flexible, e.g. https://benjjneb.github.io/dada2/assign.html has a training set where the levels are Kingdom, Phylum, Class, Order, Family, Genus - while according to https://benjjneb.github.io/dada2/training.html PR2 used "Kingdom", "Supergroup", "Division", "Class", "Order", "Family", "Genus", "Species".

Example from NCBI taxonomy (note two high level clade entries, and no class level):

Superkingdom: Eukaryota txid2759
Clade: Sar txid2698737
Clade: Stramenopiles txid33634
Phylum: Oomycota txid4762
Order: Peronosporales txid4776
Family: Peronosporaceae txid4777
Genus: Phytophthora txid4783
Species: Phytophthora infestans txid4787

Some references have Oomycetes aka Oomycota as class level. Their placement has been somewhat fluid in the last decades.

Another example, the potato psyllid, Bactericera cockerelli

cellular organism
superkingdom: Eukaryota txid2759
clade: Opisthokonta txid33154
kingdom: Metazoa txid33208
clade: Eumetazoa txid6960
clade: Bilateria txid33213
clade: Protostomia txid33317
clade: Ecdysozoa txid1206794
clade: Panarthropoda txid88770
Phylum: Arthropoda txid6656
clade: Mandibulata txid197563
clade: Pancrustacea txid197562
subphylum: Hexapoda txid6960
class: Insecta txid50557
clade: Dicondylia txid85512
subclass: Pterygota txid7496
infraclass: Neoptera txid33340
cohort: Paraneoptera txid33342
Order: Hemiptera txid7524
suborder: Sternorrhyncha txid33373
superfamily: Psylloidea txid33375
Family: Triozidae txid33375
Genus: Bactericera txid290154
Species: Bactericera cockerelli txid290155

No domain level on the NCBI taxonomy, but all the rest of the USEARCH ranks and plenty more.

Storing a fixed set of levels will inevitably be an imperfect to be a one-size-fits best effort, but makes certain operations much easier. We need some strong use cases to decide (e.g. do we want to attempt LCA resolution, or do we want to offer classification rules with cascading matches for different ranks?)

peterjc / thapbi-pict

Store deeper taxonomy? #597