viralemergence / virion

The Global Virome in One Network
https://viralemergence.github.io/virion
37 stars 8 forks source link

VirusOriginal field is missing for a small number of GenBank recods #43

Closed lbrierley closed 2 years ago

lbrierley commented 3 years ago

Documentation is clear that some virus fields should be reasonably expected to be absent, though for 17 GenBank records, VirusOriginal field is missing. From what I can gather this should reflect the organism field of the original GenBank entry, and these instances do have a value for this field.

virion %>% filter(is.na(VirusOriginal))
virion %>% filter(NCBIAccession == "MT293615")

Example entry highlighted above which has "Gemycircularvirus sp." in the organism field: https://www.ncbi.nlm.nih.gov/nuccore/MT293615

cjcarlson commented 3 years ago

(I Think You Should Leave "that's a chunky" voice) that's a Colin

cjcarlson commented 3 years ago

This is a weird one. I have a suspicion that this might have something to do with the difference between what's up there now versus what was up there in January when we pulled this down. I can confirm that rentrez pulls this down correctly, but the source data doesn't have anything in the organism field. If you want to check you can take a look at:

gb <- data.table::fread("Source/sequences.csv") %>% 
  as_tibble

My suspicion is that these may have been uploaded without the correct information and had it corrected over time, maybe by the tax team at NCBI? So I'm going to leave this open at least for a tiny bit, but I don't think it's necessarily a bug.

cjcarlson commented 2 years ago

So on revisiting this I have some weird stuff to report.

> virion %>% filter(Database == "GenBank", is.na(VirusOriginal))
# A tibble: 10 x 32
   Host      Virus HostTaxID VirusTaxID HostNCBIResolved VirusNCBIResolv~ ICTVRatified HostGenus HostFamily
   <chr>     <chr>     <dbl>      <dbl> <lgl>            <lgl>            <lgl>        <chr>     <chr>     
 1 NA        NA       337677         NA TRUE             FALSE            FALSE        NA        cricetidae
 2 NA        NA         8507         NA TRUE             FALSE            FALSE        sphenodon sphenodon~
 3 NA        NA         9655         NA TRUE             FALSE            FALSE        NA        mustelidae
 4 canis lu~ NA         9612         NA TRUE             FALSE            FALSE        canis     canidae   
 5 capra hi~ NA         9925         NA TRUE             FALSE            FALSE        capra     bovidae   
 6 homo sap~ NA         9606         NA TRUE             FALSE            FALSE        homo      hominidae 
 7 homo sap~ NA         9606         NA TRUE             FALSE            FALSE        homo      hominidae 
 8 homo sap~ NA         9606         NA TRUE             FALSE            FALSE        homo      hominidae 
 9 pusa sib~ NA         9719         NA TRUE             FALSE            FALSE        pusa      phocidae  
10 ranitome~ NA        85591         NA TRUE             FALSE            FALSE        ranitome~ dendrobat~
# ... with 23 more variables: HostOrder <chr>, HostClass <chr>, HostOriginal <chr>, VirusGenus <chr>,
#   VirusFamily <chr>, VirusOrder <chr>, VirusClass <chr>, VirusOriginal <chr>, HostFlagID <lgl>,
#   DetectionMethod <chr>, DetectionOriginal <chr>, Database <chr>, DatabaseVersion <chr>,
#   PublicationYear <dbl>, ReferenceText <chr>, PMID <dbl>, ReleaseYear <dbl>, ReleaseMonth <dbl>,
#   ReleaseDay <dbl>, CollectionYear <dbl>, CollectionMonth <dbl>, CollectionDay <dbl>,
#   NCBIAccession <chr>

> gb %>% filter(Accession == "KP272011.1")
# A tibble: 1 x 5
  Accession  Release_Date        Species Host       Collection_Date
  <chr>      <dttm>              <chr>   <chr>      <chr>          
1 KP272011.1 2015-05-10 00:00:00 ""      Cricetinae 1976           

But compare that to: https://www.ncbi.nlm.nih.gov/nuccore/KP272011

It's not incorrect on the website though - species is "" https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&ids=KP272011

So essentially, sometimes people put things as a particular Organism without putting them as a particular Species, and then that trickles down into about 10 or so messed up records. But it's not a bug, and I'm closing it out, because we know it's not any sort of error