monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Add data source ClinVar XML #276

Closed mbrush closed 6 years ago

mbrush commented 8 years ago

Our initial ingest of ClinVar (see #7) used a tsv file that was missing much of the data only captured in their XML dumps. Our second pass will leverage the XML data, and include full evidence and provenance metadata.

xml data dump: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_2015-12.xml.gz xsd schema: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xsd_public/ data dictionary: http://www.ncbi.nlm.nih.gov/projects/clinvar/ClinVarDataDictionary.pdf

mbrush commented 8 years ago

Issues raised in #269 related to where in the data pipeline to merge/collapse 'equivalent' associations that claim the same fact under a single association node. As of March 3, consensus is to merge late, at the UI level - so our DIPper pipeline will not perform this merge in the data that gets output as ttl and dumped into SciGraph. While the consensus is that this is the easiest path forward from a technical perspective for Monarch needs I would advocate that for the purposes of any linked data we provide to the community, we create a rdf dataset where equivalent associations are merged.

mbrush commented 8 years ago

Diagram of the initial proposal for structuring ClinVar Monarch:associations (which correspond for now to a single SCV):

clinvar_pass_1 3-3-16

For our first pass at ingesting this data we will:

  1. Bring in the individual SCVs as instances of oban:associations (but not group SCVs making the same association under a merged association node).
  2. Ingest only SCVs using the five relationships from the ACMG classification guidelines (pathogenic, likely pathogenic, benign, likely benign, uncertain significance). We will created five separate GENO object properties relationships to use in our associations (rather than previous proposal of collapsing these five into three)
  3. Capture the following evidence and provenance information:
    • the asserting agent/organization (e.g. ENIGMA, Counsyl)
    • classification methods/guidelines used (e.g. 'ENIGMA BRCA1/2 Classification Criteria')
    • dates assertions were created, last updated, and last evaluated.
    • supporting data for now will include only publications
    • the 'MethodType' ClinVar uses to describe type of study that generated data/evidence (one of 'clinical testing', 'literature only', 'reference population', 'research', 'curation', or 'case-control' (not used in BRCA subset))
    • some minimal info about the variant itself (as curated by clinvar)
  4. (For Rd 2?) Create links of 'equivalence' (sameAs? equivalent_to?) and 'contradiction' (contradicts) between associations that assert the same, or contradictory facts, respectively. This will support merging of equivalent associations in subsequent processing steps, and support links from associations to refuting evidence.
kshefchek commented 6 years ago

https://github.com/monarch-initiative/dipper/pull/334 https://github.com/monarch-initiative/dipper/pull/335