monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

Fix various issues with GWAS Catalog data #535

Open mbrush opened 6 years ago

mbrush commented 6 years ago

Following up on #482 about adding p-values and other association scores, we noted a few other areas where the GWAS data could be improved to make it better integrated and more useful.

Current Status:

Briefly, the GWAS associations link some dbSNP identifier subject to one of the following types of objects (using the RO:contributes_to predicate):

Problems

Associations with EFO classes as their objects are problematic as EFO is poorly integrated with other Monarch ontologies, and EFO-annotated data is incompletely indexed by cypher queries:

  1. Diseases: All EFO disease classes are get asserted in the data as a subClassOf (or rdf:type of?) DOID:4 so they get properly indexed, but are not otherwise integrated with diseases in MONDO.
  2. Phenotypes: All EFO phenotype classes are get asserted in the data as a subClassOf (or rdf:type of) the UPHENO root phenotype class UPHENO:0001001 so they get properly indexed, but are not otherwise integrated with phenotypes in the HPO.
  3. Measurements: Use of EFO measurement classes as association objects is problematic as what is really being stated here is some phenotype exists, that the measurement is meant to reveal.
    • At present, these associations are not shown in the Monarch app because Measurement classes are not labeled with the SciGraph 'Phenotype' category, and thus not returned by cypther indexing queries.
    • For an example, see dbSNP:rs1799955 which is linked to http://www.ebi.ac.uk/efo/EFO_0004611 (low density lipoprotein cholesterol measurement) in the GWAS data set - but nothing in the Monarch app here
  4. GO Terms: No attempt to integrate with existing ontologies, so don’t think these get index or displayed either.

Proposals

  1. EFO Diseases: integration into MONDO (manually or via kBOOM) should result in the GWAS disease associations being fully integrated. Most EFO disease terms have many xrefs to terms in other disease and/or phenotype vocabularies
  2. EFO Phenotypes: create equivalence axioms to HPO classes so that clique merging works and GWAS phenotype associations are fully integrated. Or swap out EFO for HPO classes.
  3. Measurement Classes: What should likely happen here is that EFO measurement classes should get mapped to HPO or OBA classes (and new HPO/OBA classes created as needed) - then these HPO/OBA classes should be the asserted association objects.
  4. GO Classes: Map these to auto-generated GO_PHENOTYPE classes? Not sure the status or these.

Also, as #482 indicates, we should capture certain evidence data such as p-values and odds ratio scores that users may want to filter data by. Currently only captured as free text in a dc:description. Capturing these is straightforward using SEPIO.

Finally, I also noted that we pull information about dbSNP variants into SciGrpah via GWAScatalog (e.g. genomic position, allele freq, taxon). We should get this directly from the source (or myVariant) - so data is current and has a single source of truth.

mbrush commented 6 years ago

Other related tickets: https://github.com/monarch-initiative/dipper/issues/320 https://github.com/monarch-initiative/monarch-disease-ontology/issues/57

mbrush commented 6 years ago

@cmungall I hear that EFO may be swapping out its disease classes for MONDO terms? This would solve problem 1 above. Is this happening . . . i see some EFO tickets popping up in the mondo repo?

Problems 2, 3, and 4 may require some ontology engineering /coordination between EFO, HPO, OBA, etc. For example creating/mapping UPHENO terms based on GO classes used in the GWAS data. Or creating/mapping OBA terms based on EFO measurements such as EFO:0004611 ! 'low density lipoprotein cholesterol measurement'.

@mellybelly thought might be good to have @dosumis help with when the time comes to address this ticket. Just wanted to tag relevant folks for now so this is on radar. Nothing we need to do immediately.

dosumis commented 6 years ago

CC @simonjupp

simonjupp commented 6 years ago

Thanks. We've started looking at the EFO xrefs and how these can be incorporated into the kboom process. Let us know anything we can do to help. Also cc relevant EFO and GWAS people at EBI @siiraa and @tburdett

cmungall commented 6 years ago

Every EFO disease class now has an equivalent in MONDO. But we need a strategy for non-disease entities