ropensci / taxize

A taxonomic toolbelt for R
https://docs.ropensci.org/taxize
Other
270 stars 61 forks source link

Order of ranks not consistent with NCBI data #875

Closed transue closed 3 years ago

transue commented 3 years ago

In a recent download of taxonomy data from NCBI, I checked the relative rankings of ALL species and found that two ranks listed in ./inst/ignore/rank_ref_script.R seem out of place. Details follow...

Session Info Using the ranks provided in at this site: https://github.com/ropensci/taxize/issues/835 which appear to be derived from ./inst/ignore/rank_ref_script.R , I found two ranks which appear out of order when used with the NCBI data from: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz 1. _forma specialis_ should be MORE specific than _forma_ 2. isolate should be MORE specific than strain (it was the most specific rank, in fact) While I do not consider NCBI's ranking to be the definitive authority, I also don't know the source of the data provided here. I imagine it is provided somewhere, but have not researched it. I propose a couple of options to whomever considers this important: - Confirm that the ranks in taxize are correct and contact NCBI so that they can update their ranking system - Confirm with an authoritative source and correct taxize - Place a note that a discrepancy was noticed and users may experience issues if assuming that ranks follow one of the two conventions.
transue commented 3 years ago

I am contractor at the US EPA and can be contacted using my EPA email address: transue.tom@epa.gov . Thank you for providing this code base! --Tom

trvinh commented 3 years ago

Hi @transue

I agree with you that isolate is more specific than strain, but how does it related to morph, pathogroup, genotype and subvariety to be the most specific rank as you mentioned? I found no connection to morph, pathogroup, genotype and subvariety to any other ranks but species.

The same for forma and formaspecialis, they are both parents of strain and child of varietas. I found no species, where it has both forma and formaspecialis in the taxonomy hierarchy in order to specify which one is more specific than the other. Or did I miss something? Interestingly, I found 3 taxa that have formaspecialis twice and directly next to each other in their hierarchy strings (taxID 1142511, 299864 and 299865).

I am attaching here the graph of all available taxonomy ranks I created from the NCBI taxonomy DB file (taxdump.tar.gz, downloaded on May 14th). Could you please correct me if I war wrong. Many thanks! image

sckott commented 3 years ago

is this issue resolved @trvinh ?

trvinh commented 3 years ago

@sckott with the PR #876, yes.

sckott commented 3 years ago

great, thx