viralemergence / virion

The Global Virome in One Network
https://viralemergence.github.io/virion
37 stars 8 forks source link

NCBI higher taxonomy has inconsistent host class designations #30

Closed eveskew closed 3 years ago

eveskew commented 3 years ago

The NCBI higher level taxonomy for hosts is inconsistent such that the HostClass field has values that don't represent a sound classification system. For example, "lepidosauria" and "reptilia" both appear as distinct values, yet the former is usually considered a clade that is nested within the latter. As a result, some good species (i.e., "python regius") can appear in the Virion data under multiple different taxonomic classes.

> library(tidyverse)
> library(vroom)
> 
> v <- vroom("Virion/Virion.csv.gz")
> 
> table(v$HostClass)

   actinopteri       amphibia           aves 
         18839           1183         278568 
chondrichthyes      cladistia    hyperoartia 
           260             37             81 
  lepidosauria       mammalia         myxini 
          1663        2604101              9 
      reptilia 
           105 
> 
> v %>%
+   filter(Host == "python regius") %>%
+   pull(HostClass) %>%
+   unique()
[1] "reptilia"     "lepidosauria"

Similar issues might affect other taxonomic fields?

cjcarlson commented 3 years ago

reptilia is gone in the latest push!