Open mbrush opened 8 years ago
Thanks Matt, this is great. In the fullness of time, it will also be important to identify where a single line of evidence may be repeated in different sources (rather than multiple lines of evidence for a single association). Eg. a paper that is curated in separate places etc. That is another problem for another day.
Brief summary of discussion on 2-24-16 DIPper call: Attending: Matt, Julie, Tom, Kent, Jeremy
The golr output process should not be affected by that, or at worst only small changes will be needed.
For info, we have 4m associations right now in SciGraph.
Right, but we have the analogous decision to make for golr. on the one hand it's simpler to have the same model, on the other there may be considerations based on how we want to filter things in the UI
Want to step back and clarify our main use cases for needing to identify and aggregate/merge 'equivalent' associations (assertions making the same claim about a G- P relationship) in the first place. This relates to the need to aggregate and assess all provenance/evidence information that supports and/or refutes each such 'unique' associations in the data. In the UI, we may want to present this information on something like the Association pages we are currently testing, so users can evaluate an association based on all the evidence there is _for_ and _against_ a given association, and where this evidence comes from. If we capture assertions but don’t have some way of identifying those that assert the same association, we cannot address such use cases. At some point in the life cycle of our data, we would need the ability to collapse all equivalent assertions.
Of course for this use case this aggregation/merging need not happen in the ttl or SciGraph data itself. The data could capture the more atomic assertions made by each source, and the aggregation can happen dynamically for purposes of presentation and queries. Ultimately what is most efficient is up to the architecture folks. But to me it seems prudent to at some point create a data artifact materializing these aggregations. If nothing else this may be useful to take burden of merging associations off the tooling - esp if/when we serve linked data for third party re-use, where users may not grasp all nuances/requirements for association equivalence (see #271).
A second point here is that, all this being said, at present we are unlikely to encounter equivalent associations _across_ sources in our data, as each source deals for the most part with disjoint sets of associations. ClinVar is unique exception - as it aggregates variant-diseases assertions from many sources. It therefore does contains many assertions of the same association, each with their own evidence/provenance trails. But these are housed _within_ a single data source (ClinVar) - so they do the hard work of merging assertions of the same associations (SCVs aggregated under RCVs). So perhaps for now we punt on the issue of dealing with equivalent associations _across_ sources, and focus on how to treat the merged associations _within_ ClinVar's data when transforming into our data model. This is a simpler and more immediate problem to tackle.
To Chris' questions:
_What are the expected gains in terms of query efficiency/simplicity? This should be traded off against complexity in the pipeline?_
_What happens if some sources provide association IDs?_
_Upstream is good but presumably equivalence sets span files_
_Did you discuss the golr output?_
I'm leaning towards late binding, ie at the UI level, in the association pages.
We should definitely eliminate redundancy in the app on disease, gene etc pages. But this may best be done at the golr level. This approach taken should be the same as for https://github.com/geneontology/amigo/issues/294
The DIPper pipeline creates oban:Associations for each G2P link it ingests from each data source, dumping these into output ttl files that get loaded into SciGraph. These Associations represent an 'assertion' as made by a specific source or database, but it is possible that more than one source will assert the same Association. For example, MGI and IMPC might both assert the Association between the same genotype (e.g. 'Rn/Rn [C57BL/6) and phenotype (e.g. MP:0000372 ! 'randomly distributed white hairs'), leading to 'equivalent Associations being dumped into SciGraph (Figures [A] and [B] at the bottom of the ticket diagram rdf representations of these Associations). These represent the same underlying Association or fact, as made in two different assertions (one by MGI, one by IMPC).
For purposes of more efficient queries and data operations, it make sense to collapse these under one Association, and maintain the provenance of the separate assertions in the evidence lines that support the Association. (One model for doing this is diagrammed in [C] below - although there are alternative models for how this merge might look.).
The question is, assuming we want to perform this merge, how and where would we perform it? Equivalent Associations from different sources don’t 'meet' each other until they have left DIPper and entered SciGrpah where data across all sources is aggregated. Some post-DIPper processing step needs to happen at a point after all data that could possibly contain equivalent associations are aggregated.
I will toss out some alternative approaches for discussion (bearing in mind my naivety as to the technical feasibility and efficiency of these options):
Some issues to consider:
FIGURES
[A] MGI assertion
[B] IMPC Assertion of same Association
[C] MGI and IMPC assertions merged under same Association