monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Update Source: ClinVar - Refine evidence/provenance metadata #439

Open mbrush opened 7 years ago

mbrush commented 7 years ago

First pass ingest of ClinVar-XML (#276)) only pulled minimal evidence and provenance information.
As SEPIO matures, we will soon be at a point where we can revisit this source to pull more of this info. A couple specific things to address:

  1. Modeling of evidence for associations based on the 'literature only' Method needs fixing. According to their documentation, this Method is cited when "Data is extracted from published literature with interpretation as reported in the citation". So the agent asserting the SCV here is just parroting assertions made in published papers. Our current model creates a single evidence line for these SCVs and links it to all referenced pubs. But in reality each pub is making an assertion that is used as evidence for the submitting agent's assertion. So we should create separate evidence lines (typed as a ECO:TAS) for each referenced paper, where each line is linked only to that single publication. And we could also create an assertion bnode as the supporting info since wee can infer that these are made given the definition of 'literature only'.
  2. Modeling of evidence for associations based on the 'curation' Method also need fixing. This method is used "for variants that were not directly observed by the submitter, but were interpreted by curation of multiple sources, including clinical testing laboratory reports, publications, private case data, and public databases." Here it is not clear how many lines of evidence exist - only that a set of referenced pubs were used in finding evidence to assess in making the SCV assertion. Here i might propose just linking the referenced pubs to the assertion directly, using dc:source. We could not create any evidence line (since we have no idea what or how many there are). Or we could create one line and link it to a supporting 'data curation' activity (so we continue to capture and be able to search/filter on assertions generated through literature curation).
  3. There is sparsely populated metadata about clinical subjects genotyped and studied in generating evidence for some assertions (those tagged with the 'clinical testing' Method in particular'. We could attempt to bring this data in as well - but would likely not add much value as it is sparse.

Note also that I suspect that ClinVar may be changing its data model given recent activity in the ClinGen community - so given the low immediate value of evidence metadata we collect from ClinVar, it may be best to just make the easy fixes to 1 and 2 above, and not spend time parsing out additional metadata as in 3.

mbrush commented 7 years ago

Also, consider that most referenced pubs for SCVs that are not 'literature only' or 'curation' do not contain evidence - but rather point to things like methods used or documentation of guidelines. e.g. for SCV000267702. So consider not linking these as supporting references for an evidence line.

cmungall commented 3 years ago

I see a lot of clinvar tickets open, is this one still relevant?

Is it a fair summation to say that the XML poses us a lot of issues?

Note the macarthur lab used to maintain a parser for clinvar XML:

https://github.com/macarthur-lab/clinvar

They now say this is not supported as the clinvar VCF provides everything

Would it make sense for us to switch to ingest the VCF? VCF is standard and there are many other areas where it would be useful to have a VCF->kg/rdf mapping.

It may make sense to think of what our end goal is here and to work backwards to the most maintainable strategy