monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Merging equivalent Associations in SciGraph #269

Open mbrush opened 8 years ago

mbrush commented 8 years ago

The DIPper pipeline creates oban:Associations for each G2P link it ingests from each data source, dumping these into output ttl files that get loaded into SciGraph. These Associations represent an 'assertion' as made by a specific source or database, but it is possible that more than one source will assert the same Association. For example, MGI and IMPC might both assert the Association between the same genotype (e.g. 'Rn/Rn [C57BL/6) and phenotype (e.g. MP:0000372 ! 'randomly distributed white hairs'), leading to 'equivalent Associations being dumped into SciGraph (Figures [A] and [B] at the bottom of the ticket diagram rdf representations of these Associations). These represent the same underlying Association or fact, as made in two different assertions (one by MGI, one by IMPC).

For purposes of more efficient queries and data operations, it make sense to collapse these under one Association, and maintain the provenance of the separate assertions in the evidence lines that support the Association. (One model for doing this is diagrammed in [C] below - although there are alternative models for how this merge might look.).

The question is, assuming we want to perform this merge, how and where would we perform it? Equivalent Associations from different sources don’t 'meet' each other until they have left DIPper and entered SciGrpah where data across all sources is aggregated. Some post-DIPper processing step needs to happen at a point after all data that could possibly contain equivalent associations are aggregated.

I will toss out some alternative approaches for discussion (bearing in mind my naivety as to the technical feasibility and efficiency of these options):

  1. Prior to dumping into Scigraph, aggregate all ttl files that could contain equivalent Associations and using something like SPARQL transforms to identify equivalent associations and create new triples materializing the merged association.
  2. After loading into SciGraph, use whatever tools would work in this setting to identify and merge equivalent Association.

Some issues to consider:

  1. Equivalent Associations are recognized by having the same subject genotype, predicate, and object phenotype/disease. And in some cases where environment or stage information is provided (e.g. ZFIN), these represent additional identity criteria for Associations that must also be matched. Recognizing equivalent phenotypes/conditions is straightforward, even if recorded using different identifiers, given the equivalency mappings we have in MONDO and Upheno. But recognizing equivalent genotypes is more challenging, given that they may have different syntax in their labels and different identifiers from their respective sources. And when environment or stage info is included in the Association, we must also consider how to recognize equivalency here.
  2. While it will be exceedingly rare in practice that different model organism data sources make assert the exact same Association between the same G and P, it is quite common in human G2P data to see the same Association asserted by many sources. This is exemplified in ClinVar, which aggregates assertions (SCVs) from many databases/sources, where many represent the same general Association. The example here shows an 'pathogenic for' Association between the variant _NM000059.3(BRCA2):c.5946del and the disease Breast-ovarian cancer, familial that is asserted by seven sources (each of which may base its assertion on different evidence and criteria). Our approach for merging such assertions of the same Association should consider use cases around filtering for data that excludes assertions from specific sources/organizations, or includes only Associations based on evidence of a given type. Where we perform the merge should accommodate such assertion/evidence level queries being performed that may require pre-merged data (or the merged data should be constructed so as to accommodate such assertion/evidence level queries).

FIGURES

[A] MGI assertion

mgi association

[B] IMPC Assertion of same Association

impc association

[C] MGI and IMPC assertions merged under same Association

merged mgi-impc association

jmcmurry commented 8 years ago

Thanks Matt, this is great. In the fullness of time, it will also be important to identify where a single line of evidence may be repeated in different sources (rather than multiple lines of evidence for a single association). Eg. a paper that is curated in separate places etc. That is another problem for another day.

mbrush commented 8 years ago

Brief summary of discussion on 2-24-16 DIPper call: Attending: Matt, Julie, Tom, Kent, Jeremy

cmungall commented 8 years ago
jnguyenx commented 8 years ago

The golr output process should not be affected by that, or at worst only small changes will be needed.

For info, we have 4m associations right now in SciGraph.

cmungall commented 8 years ago

Right, but we have the analogous decision to make for golr. on the one hand it's simpler to have the same model, on the other there may be considerations based on how we want to filter things in the UI

mbrush commented 8 years ago

Want to step back and clarify our main use cases for needing to identify and aggregate/merge 'equivalent' associations (assertions making the same claim about a G- P relationship) in the first place. This relates to the need to aggregate and assess all provenance/evidence information that supports and/or refutes each such 'unique' associations in the data. In the UI, we may want to present this information on something like the Association pages we are currently testing, so users can evaluate an association based on all the evidence there is _for_ and _against_ a given association, and where this evidence comes from. If we capture assertions but don’t have some way of identifying those that assert the same association, we cannot address such use cases. At some point in the life cycle of our data, we would need the ability to collapse all equivalent assertions.

Of course for this use case this aggregation/merging need not happen in the ttl or SciGraph data itself. The data could capture the more atomic assertions made by each source, and the aggregation can happen dynamically for purposes of presentation and queries. Ultimately what is most efficient is up to the architecture folks. But to me it seems prudent to at some point create a data artifact materializing these aggregations. If nothing else this may be useful to take burden of merging associations off the tooling - esp if/when we serve linked data for third party re-use, where users may not grasp all nuances/requirements for association equivalence (see #271).

A second point here is that, all this being said, at present we are unlikely to encounter equivalent associations _across_ sources in our data, as each source deals for the most part with disjoint sets of associations. ClinVar is unique exception - as it aggregates variant-diseases assertions from many sources. It therefore does contains many assertions of the same association, each with their own evidence/provenance trails. But these are housed _within_ a single data source (ClinVar) - so they do the hard work of merging assertions of the same associations (SCVs aggregated under RCVs). So perhaps for now we punt on the issue of dealing with equivalent associations _across_ sources, and focus on how to treat the merged associations _within_ ClinVar's data when transforming into our data model. This is a simpler and more immediate problem to tackle.

mbrush commented 8 years ago

To Chris' questions:

_What are the expected gains in terms of query efficiency/simplicity? This should be traded off against complexity in the pipeline?_

_What happens if some sources provide association IDs?_

_Upstream is good but presumably equivalence sets span files_

_Did you discuss the golr output?_

cmungall commented 8 years ago

I'm leaning towards late binding, ie at the UI level, in the association pages.

We should definitely eliminate redundancy in the app on disease, gene etc pages. But this may best be done at the golr level. This approach taken should be the same as for https://github.com/geneontology/amigo/issues/294