Closed mbrush closed 6 years ago
Issues raised in #269 related to where in the data pipeline to merge/collapse 'equivalent' associations that claim the same fact under a single association node. As of March 3, consensus is to merge late, at the UI level - so our DIPper pipeline will not perform this merge in the data that gets output as ttl and dumped into SciGraph. While the consensus is that this is the easiest path forward from a technical perspective for Monarch needs I would advocate that for the purposes of any linked data we provide to the community, we create a rdf dataset where equivalent associations are merged.
Diagram of the initial proposal for structuring ClinVar Monarch:associations (which correspond for now to a single SCV):
For our first pass at ingesting this data we will:
Our initial ingest of ClinVar (see #7) used a tsv file that was missing much of the data only captured in their XML dumps. Our second pass will leverage the XML data, and include full evidence and provenance metadata.
xml data dump: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_2015-12.xml.gz xsd schema: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xsd_public/ data dictionary: http://www.ncbi.nlm.nih.gov/projects/clinvar/ClinVarDataDictionary.pdf