monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

General review of class punning in dipper data models #258

Open mbrush opened 8 years ago

mbrush commented 8 years ago

We utilize a fair amount of punning to simplify our data model and use existing ontologies as CVs for describing our data. The most common example of this is punning gene class IRIs in linking to them from variants/alleles of the gene, e.g.:

  shha<tbx392>   is_allele_of   ZFIN:ZDB-GENE-980526-166  (d. rerio shha gene)

We also commonly pun classes of phenotypes, taxons, evidence codes, developmental stages, zygosity types, etc. I'd like to review and /or establish some best practices here to be sure to use this practice correctly, coherently, consistently. A couple specific question posed below:

  1. A related issue involves the utility of 'self-typing' (see ticket #228 as well). Is there utility to automatically typing punned IRIs as rdf:types of themselves (so as to inherit attributes of the t-box class onto the a-box individual)? What are the costs/downsides of this - in particular adding very many additional triples in the data (but these could be partitioned into a separate file)? Other foreseeable consequences?
  2. We try to pun class IRIs only as objects of triples, but in some cases we pun them as subjects of triples. Are there concerns that this makes a statement about the IRI that may not be universally true? e.g. if we pun a gene IRI into an individual, and create a triple gene - subsequence_of chromosome 7 - have we broken any unspoken rules?
mbrush commented 8 years ago

This is related to fact that, because we define a-box relationships between classes punned as individuals in the data, the same connection between concept A and concept B can get encoded in the t-box as class axioms linking A and B in an ontology file, and again in the a-box as direct OP assertion axioms linking A and B. We do this to facilitate rdf/graph based queries (where no have DL reasoning, and don't want to query across nesting and reifications of owl representation in rdf - but it creates potential for knowledge to be duplicated and inconsistent across t-box vs abox representations.

  1. Example 1: genes part_of chromosomes For zebrafish, we express fact that Shha gene is a part of chromosome 7 as a t-box class axiom in the MONOCHROM ontology, but also represent this fact as an a-box OP assertion axiom in ZFIN data (i.e. a triple linking the punned shh class IRI to the punned chromosome 7 class IRI). This facilitates graph-based query of the data in the absence of DL reasoning/queries (i.e. to SHH comes back in query for phenotypes of all zebrafish genes on Chr 7).
  2. Example 2: IMPC procedures part_of pipelines IRIs representing punned IMPRESS procedure classes are asserted as parts of punned IMPRESS pipeline IRIs using OP assertions in the IMPC data. But we also want to represent this relationship at the class level - e.g. that all 'DEXA' procedures are part of 'EUMODICC pipeline 1'.
  3. Example 3: genes in_taxon organism/taxa

Anywho, just listing these as examples to help consider implications of the punning approach and duplication of knowledge in the a-box and t-box.