viralemergence / virion

The Global Virome in One Network
https://viralemergence.github.io/virion
37 stars 8 forks source link

Some species have conflicting higher taxonomies #39

Closed maxfarrell closed 3 years ago

maxfarrell commented 3 years ago

When attempting to make taxonomic trees I noticed that some species have conflicting higher taxonomies. This can be seen in the following example where Host is not NA:

require(dplyr)
require(vroom)
virion <- vroom("Virion/Virion.csv.gz")

hosttax <- virion %>% select(HostClass, HostOrder, HostFamily, HostGenus, Host) %>% unique()

sum(duplicated(hosttax$Host))# 443 duplicated 
dups <- hosttax[duplicated(hosttax$Host),]
dups[!is.na(dups$Host),]

This came up with four cases:

# labroides dimidiatus
hosttax[hosttax$Host=="labroides dimidiatus",] # this is the actinopterygii case

# leontocebus nigricollis
unique(hosttax[hosttax$Host=="leontocebus nigricollis",]) # callitrichidae vs cebidae in family

# rupornis magnirostris
unique(hosttax[hosttax$Host=="rupornis magnirostris",]) # one has NA for genus

# marmosets (lol) -> has NA for Family and marmosets for genus as well
cjcarlson commented 3 years ago

So, we dealt with labroides on another post - that's a CLOVERT/NCBITaxonomy.jl issue.

saguinus nigricollis is the synonym for lentocebus that pulled up different taxonomy in CLOVER - and it doesn't have an NCBI match, so I wonder if it was manually curated by Rory? or is it a product of findSyns?

rupornis is also a synonym issue. the other match, from clover, is buteo magnirostris. but it doesn't have a host genus because, well, idk - probably outdated CLOVER code again. I think we're seeing a pattern.

cjcarlson commented 3 years ago

marmosets is its own issue - let me create it.

cjcarlson commented 3 years ago

https://github.com/viralemergence/virion/issues/40

cjcarlson commented 3 years ago

The rest of these are Rory (so Rory - don't worry about marmosets, but the other two + the one documented on another post), so I'm going to call it a CLOVERT bug and leave it to him. I think basically these are two special cases where findSyns and/or manual curation had a weird outcome

rorygibb commented 3 years ago

Oh this is strange - it shouldn't be findSyns as I removed that from the pipeline entirely. Might be an issue of some older manual curation - I'll look into this now

rorygibb commented 3 years ago

@cjcarlson Fixed these and pushed a CLOVER update to the repo, so if you re-run the CLOVER integration these should go away in VIRION.

The problem was a few inconsistencies between manual higher tax and automated higher tax from hdict() - mainly caused by variable spellings in Host_Original in source datasets. All sorted now for these three but there could perhaps be more - I will keep an eye out

cjcarlson commented 3 years ago

Nice work!