monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Generally improve evidence/provenance capture and reporting #787

Closed mbrush closed 5 years ago

mbrush commented 5 years ago

At the May 2019 F2F in Corvallis, and on recent Monarch Data calls, we discussed the need to provide better provenance and evidence for associations in the Monarch app. Internally this is useful for us to understand, QC, and document our own data. And of course for external users it is critical for them to trust and apply the data we provide.

At present unfolding the Support column for a Monarch association shows an unorganized list of sources, evidence codes, and publications (see figure below). One key requirement were to be clear about what reported source(s) actually made the assertion captured in an association, as opposed to what sources provided supporting data used in inferring an association. This includes indicating when the association is something inferred by Monarch by joining/reasoning over data from one or more sources. When an association is inferred, we should provide access to the inference/reasoning path - which is currently captured in 'evidence Graphs' served by the Biolink API, but need to be rendered in a more human readable way (graph-viz was proposed).

Another requirement is to organize the evidence and publications according to the sources that used them - in particular when an association is asserted by >1 source. This lets us understand what sources used what publications and what type of evidence these publications provided. At the May F2F it was suggested that we should organize evidence and publications according to the reported sources that used them. And perhaps distinguishing 'asserting sources' form a 'supporting sources' - rather than simply calling all of them 'sources'.

Hoping others can add examples they have come across examples of evidence/provenance shortcomings. A simple example I can provide is the association reported here between the ALG9 gene and Multicystic kidney dysplasia phenotype. The 'Support' drop down lists one ECO code, one publication, and two sources (HPOA and CLINVAR).

ALG9-MKD

We wouldn't know it from looking at the metadata here, but neither of these sources directly asserts the reported association. Rather, it is inferred by Monarch by joining data along the path from gene -> variant -> disease -> phenotype (as specified in the cypher query here). Furthermore, it is not clear how the publication and evidence code are used by/related to the indicated sources (who used what and how).

monicacecilia commented 5 years ago

Chatting with @kshefchek & @cmungall

To organize provenance data, turn BBOP/OBO format into a table?

We want people to be able to see the chain of inference.

kshefchek commented 5 years ago

Can we merge with this with https://github.com/monarch-initiative/monarch-ui/issues/28? I will have a prototype up by early next week