Open cjcarlson opened 2 years ago
@tpoisot we'll need to figure out how to pull down the massive file and then cut down to Virus/Viruses as a replacement for the 'rglobi' pipeline, a la how we did it with NCBI, without breaking the Github Actions
No problem. We can make a branch that doesn't deploy anything after the build. If you know which file I need to pull, I can get started on this really easily.
From Jorrit in October:
re: using the R API for GloBI as stated on the GloBI data page at https://globalbioticinteractions.org/data :
"Exploratory, interactive queries can be executed through SPARQL and Cypher (see more examples) endpoints, GloBI Search/Browse pages, or by using the REST-y GloBI Web API. For those that use R, rglobi is available to explore interaction data. rglobi can also be used to execute Cypher queries.
For research or other data intensive project, please use GloBI’s stable versioned integrated data published via doi:10.5281/zenodo.3950589 or, perhaps even better, consider using the original underlying datasets. Please see the process page to better understand how GloBI integrates data so that you can make an informed decision on what data to use for your studies."
Also, if you do choose to continue to use the R API for your research, please note that there's ways to include the source meta-data into your results. See e.g., https://github.com/ropensci/rglobi/blob/0717d317c8048af36168e61750fa79e296f2ff06/R/rglobi.R#L193 . If this does not contain the fields you are looking for, please open a separate issue with a description of the field and one or two specific examples.
I coded all of it in R because that's the only language I know, but I'm also not sure that 'rglobi' actually allows the sourcing to be called using that argument. I'll check now before I step out...
Ok so we can pull the data, remove anything attributed to Virion, cut to viruses, and reindex?
> library(rglobi)
> get_interactions_by_taxa("Variola", "Variola")
source_taxon_external_id source_taxon_name source_taxon_path source_specimen_life_stage interaction_type target_taxon_external_id target_taxon_name
1 NCBI:12870 Variola major virus NA NA pathogenOf GBIF:2389099 Variola
2 NCBI:12870 Variola major virus NA NA pathogenOf GBIF:2389099 Variola
target_taxon_path target_specimen_life_stage latitude longitude study_citation
1 Animalia | Chordata | Actinopterygii | Perciformes | Serranidae | Variola NA NA NA NA
2 Animalia | Chordata | Actinopterygii | Perciformes | Serranidae | Variola NA NA NA NA
study_source_citation
1 NA
2 NA
> get_interactions_by_taxa("Variola", "Variola", showfield = "...")
I could try putting something in the "..." but I don't know how to see what the API fields are
The two queries I use are sourcetaxon = 'Virus' / 'Viruses' matched to targettaxon = 'Vertebrata'
At present script '02_3a_Download GLOBI.R' is the thing that generates a file written to 'Source/GLOBI-raw.csv' based on those queries. If you can do something to do this:
Ok so we can pull the data, remove anything attributed to Virion, cut to viruses, and reindex?
And write it to that file, I think we can just replace the rest!
if you do choose to use rglobi instead of published data products https://globalbioticinteractions.org/data , you might be interested in rglobi vignette fragment mentioned in https://github.com/ropensci/rglobi/issues/38 .
@cjcarlson I think I can pull the raw data and work from then, no problem. I'll use the file in the repo as a template and make a diff, if it's all above board we can merge when you come back.
Sounds great. I think Jorrit is right that we could implement the other solution but returnobservations = T
will return a much, much bigger dataset and I'll need to do some unique operations to it. I hadn't really processed that those were happening automatically already - can you incorporate that on your end as well? Slash, maybe we can figure out how to harmonize the study_source_citation field into the VIRION architecture as a longer-term thing, to improve credit and attribution.
The only GLOBI fields currently retained are
source_taxon_external_id,
source_taxon_name,
target_taxon_external_id,
target_taxon_name
So those are what we turn into unique values and process.
If you can hotfix something that just keeps those four in GLOBI-raw.csv
, after subsetting out the VIRION citations, we can then come back in a bit and figure out how to deal with dataset attribution versus data primary literature source attribution
The uncleaned citation data are a bit of a mess, but (omitting IDs), this is what I get:
Yellow fever virus|Callithrix penicillata|Wardeh et al. 2015 Sci Data
Yellow fever virus|Carollia perspicillata|Price, J.L. Isolation Of Rio Bravo and a hitherto undescribed agent, Tamana bat virus, from insectivorous bats in Trinidad, with serological evidence of infection in Bats And Man. Am. J. Trop. Med. Hyg. 1978, 27, 153-161.
Yellow fever virus|Carollia perspicillata|https://www.ncbi.nlm.nih.gov/pubmed/204207
Yellow fever virus|Carollia perspicillata|Wardeh et al. 2015 Sci Data
Yellow fever virus|Cercopithecus ascanius|https://www.ncbi.nlm.nih.gov/pubmed/14893439
Yellow fever virus|Cercopithecus ascanius|Wardeh et al. 2015 Sci Data
Yellow fever virus|Cercopithecus diana|Wardeh et al. 2015 Sci Data
The query I used is, for the record:
SELECT DISTINCT
sourceTaxonIds, sourceTaxonName, targetTaxonIds, targetTaxonName, referenceCitation
FROM
interactions
WHERE
sourceTaxonKingdomName IN ('Virus', 'Viruses')
AND targetTaxonClassName LIKE 'Mammal%'
AND referenceCitation NOT LIKE '%Virion.csv.gz'
ORDER BY
sourceTaxonName, targetTaxonName;
This gives us ~ 45k rows, the Variola issue is fixed, which is good. First five rows look like:
EOL_V2:741107 | WD:Q4681774;Adelaide River virus;EOL:10408207 | EOL:328699 | GBIF:2441022 | GBIF:4262590 | INAT_TAXON:74113 | IRMNG:10194972 | ITIS:183838 | ITIS:898719 | WD:Q21401245 | doi:10.5281/zenodo.3916389 | http://taxon-concept.plazi.org/id/Animalia/Bos_taurus_Linnaeus_1758 | http://treatment.plazi.org/id/80F7388E4BC26B8776DCEE1EC2042A83;Bos taurus;http://www.ncbi.nlm.nih.gov/nuccore/600151
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/7337871
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/3722893
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/1340757
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/1574748
The one issue is that there is no way this runs as a github action since the unzipped interactions database clocks in at 24G. So we either need periodic update cycles (can do), I can spin a VM to do it for us, or we rely on the much slower API.
Note that most taxa have multiple IDs, and I don't know why that's not the case in the file we get out of the API. I can also get the taxonomic path or whatever it's called, or just the taxon name. Gonna put this on hold until we figure out how to handle the size issue.
The edgelist has 2487 unique interactions (uncleaned names), in case we want to sense-check it against the API version
@tpoisot @cjcarlson what is the status of this issue?
Also, suggest to rephrase the title from "Recursion problem with GloBI" to "improve methods to de-duplicate VIRION records"
As predicted months ago, the GLOBI-VIRION recursion has entrenched spurious records. For example, the variola-Variola association has otherwise been fixed in GLOBI, but is now in VIRION, and so is now in GLOBI as VIRION:
Oops!
To fix it, I need to switch the GLOBI sourcing to point directly to their source files, which will allow attribution to be retained - and therefore allow VIRION-in-GLOBI to removed from GLOBI-in-VIRION. This should kill the recursion issue, and allow spurious records to be updated out of both.
This is a solution that allows @jhpoelen to keep VIRION indexed in GLOBI (even though we suggest this not to happen, we've been convinced this has to be fair game: https://github.com/globalbioticinteractions/globalbioticinteractions/issues/665 https://github.com/globalbioticinteractions/virion/issues/1) and kills the recursion issue, plus it follows Jorrit's suggested use cases for GLOBI.