Merging equivalent Associations in SciGraph

mbrush commented 8 years ago

The DIPper pipeline creates oban:Associations for each G2P link it ingests from each data source, dumping these into output ttl files that get loaded into SciGraph. These Associations represent an 'assertion' as made by a specific source or database, but it is possible that more than one source will assert the same Association. For example, MGI and IMPC might both assert the Association between the same genotype (e.g. 'Rn/Rn [C57BL/6) and phenotype (e.g. MP:0000372 ! 'randomly distributed white hairs'), leading to 'equivalent Associations being dumped into SciGraph (Figures [A] and [B] at the bottom of the ticket diagram rdf representations of these Associations). These represent the same underlying Association or fact, as made in two different assertions (one by MGI, one by IMPC).

For purposes of more efficient queries and data operations, it make sense to collapse these under one Association, and maintain the provenance of the separate assertions in the evidence lines that support the Association. (One model for doing this is diagrammed in [C] below - although there are alternative models for how this merge might look.).

The question is, assuming we want to perform this merge, how and where would we perform it? Equivalent Associations from different sources don’t 'meet' each other until they have left DIPper and entered SciGrpah where data across all sources is aggregated. Some post-DIPper processing step needs to happen at a point after all data that could possibly contain equivalent associations are aggregated.

I will toss out some alternative approaches for discussion (bearing in mind my naivety as to the technical feasibility and efficiency of these options):

Prior to dumping into Scigraph, aggregate all ttl files that could contain equivalent Associations and using something like SPARQL transforms to identify equivalent associations and create new triples materializing the merged association.
After loading into SciGraph, use whatever tools would work in this setting to identify and merge equivalent Association.

Some issues to consider:

Equivalent Associations are recognized by having the same subject genotype, predicate, and object phenotype/disease. And in some cases where environment or stage information is provided (e.g. ZFIN), these represent additional identity criteria for Associations that must also be matched. Recognizing equivalent phenotypes/conditions is straightforward, even if recorded using different identifiers, given the equivalency mappings we have in MONDO and Upheno. But recognizing equivalent genotypes is more challenging, given that they may have different syntax in their labels and different identifiers from their respective sources. And when environment or stage info is included in the Association, we must also consider how to recognize equivalency here.
While it will be exceedingly rare in practice that different model organism data sources make assert the exact same Association between the same G and P, it is quite common in human G2P data to see the same Association asserted by many sources. This is exemplified in ClinVar, which aggregates assertions (SCVs) from many databases/sources, where many represent the same general Association. The example here shows an 'pathogenic for' Association between the variant _NM000059.3(BRCA2):c.5946del and the disease Breast-ovarian cancer, familial that is asserted by seven sources (each of which may base its assertion on different evidence and criteria). Our approach for merging such assertions of the same Association should consider use cases around filtering for data that excludes assertions from specific sources/organizations, or includes only Associations based on evidence of a given type. Where we perform the merge should accommodate such assertion/evidence level queries being performed that may require pre-merged data (or the merged data should be constructed so as to accommodate such assertion/evidence level queries).

FIGURES

[A] MGI assertion

mgi association

[B] IMPC Assertion of same Association

impc association

[C] MGI and IMPC assertions merged under same Association

merged mgi-impc association

jmcmurry commented 8 years ago

Thanks Matt, this is great. In the fullness of time, it will also be important to identify where a single line of evidence may be repeated in different sources (rather than multiple lines of evidence for a single association). Eg. a paper that is curated in separate places etc. That is another problem for another day.

mbrush commented 8 years ago

Brief summary of discussion on 2-24-16 DIPper call: Attending: Matt, Julie, Tom, Kent, Jeremy

All agreed on desired endpoint/model for the merged associations, and discussed parameters to be considered for achieving this.
Jeremy will explore pros/cons of approaches outlined above (performing association merges on ttl pre-Scigraph, using SPARQL or other technology, vs performing merges in SciGrpah itself.)
Matt to draft some rules and considerations for determining equivalence of entities related to associations (see #271)
Was noted that the former has benefit of creating a rdf version of the processed/merged data that could support LOD services/SPARQL endpoints, and serve as archive-able snapshots of the raw data at different points in time.
Was also noted that this work is timely for some of the Identifier commons work Julie is involved in, and could provide good real life examples of the need for identifying equivalencies between and merging identifiers.
Feedback from others, @cmungall, @mellybelly?

cmungall commented 8 years ago

What are the expected gains in terms of query efficiency/simplicity? This should be traded off against complexity in the pipeline
What happens if some sources provide association IDs?
Upstream is good but presumably equivalence sets span files
Did you discuss the golr output?

jnguyenx commented 8 years ago

The golr output process should not be affected by that, or at worst only small changes will be needed.

For info, we have 4m associations right now in SciGraph.

cmungall commented 8 years ago

Right, but we have the analogous decision to make for golr. on the one hand it's simpler to have the same model, on the other there may be considerations based on how we want to filter things in the UI

mbrush commented 8 years ago

Want to step back and clarify our main use cases for needing to identify and aggregate/merge 'equivalent' associations (assertions making the same claim about a G- P relationship) in the first place. This relates to the need to aggregate and assess all provenance/evidence information that supports and/or refutes each such 'unique' associations in the data. In the UI, we may want to present this information on something like the Association pages we are currently testing, so users can evaluate an association based on all the evidence there is _for_ and _against_ a given association, and where this evidence comes from. If we capture assertions but don’t have some way of identifying those that assert the same association, we cannot address such use cases. At some point in the life cycle of our data, we would need the ability to collapse all equivalent assertions.

Of course for this use case this aggregation/merging need not happen in the ttl or SciGraph data itself. The data could capture the more atomic assertions made by each source, and the aggregation can happen dynamically for purposes of presentation and queries. Ultimately what is most efficient is up to the architecture folks. But to me it seems prudent to at some point create a data artifact materializing these aggregations. If nothing else this may be useful to take burden of merging associations off the tooling - esp if/when we serve linked data for third party re-use, where users may not grasp all nuances/requirements for association equivalence (see #271).

A second point here is that, all this being said, at present we are unlikely to encounter equivalent associations _across_ sources in our data, as each source deals for the most part with disjoint sets of associations. ClinVar is unique exception - as it aggregates variant-diseases assertions from many sources. It therefore does contains many assertions of the same association, each with their own evidence/provenance trails. But these are housed _within_ a single data source (ClinVar) - so they do the hard work of merging assertions of the same associations (SCVs aggregated under RCVs). So perhaps for now we punt on the issue of dealing with equivalent associations _across_ sources, and focus on how to treat the merged associations _within_ ClinVar's data when transforming into our data model. This is a simpler and more immediate problem to tackle.

mbrush commented 8 years ago

To Chris' questions:

_What are the expected gains in terms of query efficiency/simplicity? This should be traded off against complexity in the pipeline?_

Given my comment above, and assuming all agree on the need to identify and merge equivalent associations at some point in our data flow, I leave it to the architecture folks to decide at what point in the pipeline it would be most efficient to do so (in the ttl prior to SciGraph, in SciGraph itself, or during processing for presentation/queries). But I would advocate that whatever works best for Monarch, we create a dataset where associations are merged to provide to users.

_What happens if some sources provide association IDs?_

I guess this would depend on our approach for minting Monarch association identifiers. We could base our primary IRI on the source ID in cases where it is provided, or create the primary association IRI ourselves (as we will have to do for most sources), and then create a link to the source ID (via a owl:sameAs or xref, depending on the equivalence of our association and theirs)

_Upstream is good but presumably equivalence sets span files_

Yes, one idea for doing this upstream (post DIPper and pre SciGraph) was to identify all dumped ttl files that could possibly contain equivalent associations across sources, merge them, and use some sparql transform like approach for post-processing to create the merged associations.

_Did you discuss the golr output?_

I will defer to Jeremy and Kent here.

cmungall commented 8 years ago

I'm leaning towards late binding, ie at the UI level, in the association pages.

We should definitely eliminate redundancy in the app on disease, gene etc pages. But this may best be done at the golr level. This approach taken should be the same as for https://github.com/geneontology/amigo/issues/294

monarch-initiative / dipper

Merging equivalent Associations in SciGraph #269