Closed lbrierley closed 2 years ago
(I Think You Should Leave "that's a chunky" voice) that's a Colin
This is a weird one. I have a suspicion that this might have something to do with the difference between what's up there now versus what was up there in January when we pulled this down. I can confirm that rentrez
pulls this down correctly, but the source data doesn't have anything in the organism field. If you want to check you can take a look at:
gb <- data.table::fread("Source/sequences.csv") %>%
as_tibble
My suspicion is that these may have been uploaded without the correct information and had it corrected over time, maybe by the tax team at NCBI? So I'm going to leave this open at least for a tiny bit, but I don't think it's necessarily a bug.
So on revisiting this I have some weird stuff to report.
> virion %>% filter(Database == "GenBank", is.na(VirusOriginal))
# A tibble: 10 x 32
Host Virus HostTaxID VirusTaxID HostNCBIResolved VirusNCBIResolv~ ICTVRatified HostGenus HostFamily
<chr> <chr> <dbl> <dbl> <lgl> <lgl> <lgl> <chr> <chr>
1 NA NA 337677 NA TRUE FALSE FALSE NA cricetidae
2 NA NA 8507 NA TRUE FALSE FALSE sphenodon sphenodon~
3 NA NA 9655 NA TRUE FALSE FALSE NA mustelidae
4 canis lu~ NA 9612 NA TRUE FALSE FALSE canis canidae
5 capra hi~ NA 9925 NA TRUE FALSE FALSE capra bovidae
6 homo sap~ NA 9606 NA TRUE FALSE FALSE homo hominidae
7 homo sap~ NA 9606 NA TRUE FALSE FALSE homo hominidae
8 homo sap~ NA 9606 NA TRUE FALSE FALSE homo hominidae
9 pusa sib~ NA 9719 NA TRUE FALSE FALSE pusa phocidae
10 ranitome~ NA 85591 NA TRUE FALSE FALSE ranitome~ dendrobat~
# ... with 23 more variables: HostOrder <chr>, HostClass <chr>, HostOriginal <chr>, VirusGenus <chr>,
# VirusFamily <chr>, VirusOrder <chr>, VirusClass <chr>, VirusOriginal <chr>, HostFlagID <lgl>,
# DetectionMethod <chr>, DetectionOriginal <chr>, Database <chr>, DatabaseVersion <chr>,
# PublicationYear <dbl>, ReferenceText <chr>, PMID <dbl>, ReleaseYear <dbl>, ReleaseMonth <dbl>,
# ReleaseDay <dbl>, CollectionYear <dbl>, CollectionMonth <dbl>, CollectionDay <dbl>,
# NCBIAccession <chr>
> gb %>% filter(Accession == "KP272011.1")
# A tibble: 1 x 5
Accession Release_Date Species Host Collection_Date
<chr> <dttm> <chr> <chr> <chr>
1 KP272011.1 2015-05-10 00:00:00 "" Cricetinae 1976
But compare that to: https://www.ncbi.nlm.nih.gov/nuccore/KP272011
It's not incorrect on the website though - species is "" https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&ids=KP272011
So essentially, sometimes people put things as a particular Organism without putting them as a particular Species, and then that trickles down into about 10 or so messed up records. But it's not a bug, and I'm closing it out, because we know it's not any sort of error
Documentation is clear that some virus fields should be reasonably expected to be absent, though for 17 GenBank records, VirusOriginal field is missing. From what I can gather this should reflect the
organism
field of the original GenBank entry, and these instances do have a value for this field.Example entry highlighted above which has "Gemycircularvirus sp." in the organism field: https://www.ncbi.nlm.nih.gov/nuccore/MT293615