viralemergence / virion

The Global Virome in One Network
https://viralemergence.github.io/virion
37 stars 8 forks source link

Recursion problem with GLOBI #67

Open cjcarlson opened 2 years ago

cjcarlson commented 2 years ago

As predicted months ago, the GLOBI-VIRION recursion has entrenched spurious records. For example, the variola-Variola association has otherwise been fixed in GLOBI, but is now in VIRION, and so is now in GLOBI as VIRION:

image

Oops!

To fix it, I need to switch the GLOBI sourcing to point directly to their source files, which will allow attribution to be retained - and therefore allow VIRION-in-GLOBI to removed from GLOBI-in-VIRION. This should kill the recursion issue, and allow spurious records to be updated out of both.

This is a solution that allows @jhpoelen to keep VIRION indexed in GLOBI (even though we suggest this not to happen, we've been convinced this has to be fair game: https://github.com/globalbioticinteractions/globalbioticinteractions/issues/665 https://github.com/globalbioticinteractions/virion/issues/1) and kills the recursion issue, plus it follows Jorrit's suggested use cases for GLOBI.

cjcarlson commented 2 years ago

@tpoisot we'll need to figure out how to pull down the massive file and then cut down to Virus/Viruses as a replacement for the 'rglobi' pipeline, a la how we did it with NCBI, without breaking the Github Actions

tpoisot commented 2 years ago

No problem. We can make a branch that doesn't deploy anything after the build. If you know which file I need to pull, I can get started on this really easily.

cjcarlson commented 2 years ago

From Jorrit in October:

re: using the R API for GloBI as stated on the GloBI data page at https://globalbioticinteractions.org/data :

"Exploratory, interactive queries can be executed through SPARQL and Cypher (see more examples) endpoints, GloBI Search/Browse pages, or by using the REST-y GloBI Web API. For those that use R, rglobi is available to explore interaction data. rglobi can also be used to execute Cypher queries.

For research or other data intensive project, please use GloBI’s stable versioned integrated data published via doi:10.5281/zenodo.3950589 or, perhaps even better, consider using the original underlying datasets. Please see the process page to better understand how GloBI integrates data so that you can make an informed decision on what data to use for your studies."

Also, if you do choose to continue to use the R API for your research, please note that there's ways to include the source meta-data into your results. See e.g., https://github.com/ropensci/rglobi/blob/0717d317c8048af36168e61750fa79e296f2ff06/R/rglobi.R#L193 . If this does not contain the fields you are looking for, please open a separate issue with a description of the field and one or two specific examples.

cjcarlson commented 2 years ago

I coded all of it in R because that's the only language I know, but I'm also not sure that 'rglobi' actually allows the sourcing to be called using that argument. I'll check now before I step out...

tpoisot commented 2 years ago

Ok so we can pull the data, remove anything attributed to Virion, cut to viruses, and reindex?

cjcarlson commented 2 years ago
> library(rglobi)
> get_interactions_by_taxa("Variola", "Variola")
  source_taxon_external_id   source_taxon_name source_taxon_path source_specimen_life_stage interaction_type target_taxon_external_id target_taxon_name
1               NCBI:12870 Variola major virus                NA                         NA       pathogenOf             GBIF:2389099           Variola
2               NCBI:12870 Variola major virus                NA                         NA       pathogenOf             GBIF:2389099           Variola
                                                          target_taxon_path target_specimen_life_stage latitude longitude study_citation
1 Animalia | Chordata | Actinopterygii | Perciformes | Serranidae | Variola                         NA       NA        NA             NA
2 Animalia | Chordata | Actinopterygii | Perciformes | Serranidae | Variola                         NA       NA        NA             NA
  study_source_citation
1                    NA
2                    NA
> get_interactions_by_taxa("Variola", "Variola", showfield = "...")

I could try putting something in the "..." but I don't know how to see what the API fields are

cjcarlson commented 2 years ago

The two queries I use are sourcetaxon = 'Virus' / 'Viruses' matched to targettaxon = 'Vertebrata'

cjcarlson commented 2 years ago

At present script '02_3a_Download GLOBI.R' is the thing that generates a file written to 'Source/GLOBI-raw.csv' based on those queries. If you can do something to do this:

Ok so we can pull the data, remove anything attributed to Virion, cut to viruses, and reindex?

And write it to that file, I think we can just replace the rest!

jhpoelen commented 2 years ago

if you do choose to use rglobi instead of published data products https://globalbioticinteractions.org/data , you might be interested in rglobi vignette fragment mentioned in https://github.com/ropensci/rglobi/issues/38 .

tpoisot commented 2 years ago

@cjcarlson I think I can pull the raw data and work from then, no problem. I'll use the file in the repo as a template and make a diff, if it's all above board we can merge when you come back.

cjcarlson commented 2 years ago

Sounds great. I think Jorrit is right that we could implement the other solution but returnobservations = T will return a much, much bigger dataset and I'll need to do some unique operations to it. I hadn't really processed that those were happening automatically already - can you incorporate that on your end as well? Slash, maybe we can figure out how to harmonize the study_source_citation field into the VIRION architecture as a longer-term thing, to improve credit and attribution.

cjcarlson commented 2 years ago

The only GLOBI fields currently retained are

source_taxon_external_id,
         source_taxon_name,
         target_taxon_external_id,
         target_taxon_name

So those are what we turn into unique values and process.

cjcarlson commented 2 years ago

If you can hotfix something that just keeps those four in GLOBI-raw.csv, after subsetting out the VIRION citations, we can then come back in a bit and figure out how to deal with dataset attribution versus data primary literature source attribution

tpoisot commented 2 years ago

The uncleaned citation data are a bit of a mess, but (omitting IDs), this is what I get:

Yellow fever virus|Callithrix penicillata|Wardeh et al. 2015 Sci Data
Yellow fever virus|Carollia perspicillata|Price, J.L. Isolation Of Rio Bravo and a hitherto undescribed agent, Tamana bat virus, from insectivorous bats in Trinidad, with serological evidence of infection in Bats And Man. Am. J. Trop. Med. Hyg. 1978, 27, 153-161. 
Yellow fever virus|Carollia perspicillata|https://www.ncbi.nlm.nih.gov/pubmed/204207
Yellow fever virus|Carollia perspicillata|Wardeh et al. 2015 Sci Data
Yellow fever virus|Cercopithecus ascanius|https://www.ncbi.nlm.nih.gov/pubmed/14893439
Yellow fever virus|Cercopithecus ascanius|Wardeh et al. 2015 Sci Data
Yellow fever virus|Cercopithecus diana|Wardeh et al. 2015 Sci Data
tpoisot commented 2 years ago

The query I used is, for the record:

SELECT DISTINCT
    sourceTaxonIds, sourceTaxonName, targetTaxonIds, targetTaxonName, referenceCitation
FROM
    interactions
WHERE
    sourceTaxonKingdomName IN ('Virus', 'Viruses')
    AND targetTaxonClassName LIKE 'Mammal%'
    AND referenceCitation NOT LIKE '%Virion.csv.gz'
ORDER BY
    sourceTaxonName, targetTaxonName;
tpoisot commented 2 years ago

This gives us ~ 45k rows, the Variola issue is fixed, which is good. First five rows look like:

EOL_V2:741107 | WD:Q4681774;Adelaide River virus;EOL:10408207 | EOL:328699 | GBIF:2441022 | GBIF:4262590 | INAT_TAXON:74113 | IRMNG:10194972 | ITIS:183838 | ITIS:898719 | WD:Q21401245 | doi:10.5281/zenodo.3916389 | http://taxon-concept.plazi.org/id/Animalia/Bos_taurus_Linnaeus_1758 | http://treatment.plazi.org/id/80F7388E4BC26B8776DCEE1EC2042A83;Bos taurus;http://www.ncbi.nlm.nih.gov/nuccore/600151
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/7337871
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/3722893
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/1340757
EOL:540097 | GBIF:8984596 | GBIF:9997724 | IRMNG:11459028 | WD:Q4408311;African horse sickness virus;EOL:1228387 | GBIF:6164210 | INAT_TAXON:47144 | IRMNG:11407661 | ITIS:726821;Canis lupus familiaris;http://www.ncbi.nlm.nih.gov/pubmed/1574748
tpoisot commented 2 years ago

The one issue is that there is no way this runs as a github action since the unzipped interactions database clocks in at 24G. So we either need periodic update cycles (can do), I can spin a VM to do it for us, or we rely on the much slower API.

tpoisot commented 2 years ago

Note that most taxa have multiple IDs, and I don't know why that's not the case in the file we get out of the API. I can also get the taxonomic path or whatever it's called, or just the taxon name. Gonna put this on hold until we figure out how to handle the size issue.

tpoisot commented 2 years ago

The edgelist has 2487 unique interactions (uncleaned names), in case we want to sense-check it against the API version

jhpoelen commented 2 years ago

@tpoisot @cjcarlson what is the status of this issue?

Also, suggest to rephrase the title from "Recursion problem with GloBI" to "improve methods to de-duplicate VIRION records"