Extend ermrest schema definition language to include semantic definitions

cmungall commented 8 years ago

Feel free to close if this is not in scope. We discussed this briefly at the OHSU meeting.

I propose extensions to the schema language in order to semantically annotate column and table definitions. Roughly:

Add a new first-class datatype for OntologyClass (see #42). Any column described using this can further be described using any combination of ontology classes (with the ability to specify conjunction or disjunction). For example, an anatomy field could be specified as containing any instance of UBERON:0001062 (anatomical entity)
Add the ability to describe relationships (either between two columns in a table, or between a PK and FK spanning two tables). For example, RO:0002206 (expressed in) for a relationship between a gene and an anatomical entity, or rdfs:label between a PK and a label
Motivations

In addition to providing explicit formal documentation on what the meaning of the schema is, this can deliver some pragmatic advantages.

For (1), this can be used to drive data entry and curation interfaces. For example, the type field could be fed into the SciGraph autocomplete interface, so that for example a curator is never offered a phenotype term when they are entering wild-type gene expression.

For the combination of 1+2, we are essentially defining an RDF/OWL model for any ERMrest schema. This can deliver a number of advantages. From a selfish monarch/facebase perspective it means we never have to write any explicit dipper code, we can have a generic mapping to go from any ermrest data into out database.

I think this will have some long term advantages to the ermrest framework. It will allow the simple integration of semantic constraints into ermrest. As a guiding example, consider an anatomical structure such as the frontonasal prominence. The ontology encodes a number of semantic constraints such as: (i) the structure ceases to exist post-embryonically (in particular species we can be more precise, e.g. CS18 in human); (ii) the structure is only found in vertebrates. Relational integrity constraints are insufficiently expressive to detect invalid states (e.g. if a gene is expressed in this structure in an adult, or in a non-vertebrate), but an OWL reasoner is well-equipped for performing these checks, using axioms encoded in ontologies such as Uberon. To run these checks it's necessary to translate the data to triples using the same relations and design patterns as in the ontology. This translation can be automated if the schema is sufficiently well-described.

Implementation

There are existing mapping languages for SQL to RDF, e.g. D2RQ, but these are the wrong level of abstraction here.

Elements of JSON-LD should definitely be used, e.g. CURIE expansion at a minimum.

If this is a promising idea, I can draft more ideas on what the extension would look like.

I don't know how much work would be required on the ermrest framework. As a first pass, the extensions could be functionally silent, with functionality coming later.

cc @mellybelly @robes @kshefcheck @TomConlin

References

Anatomical constraints described in:

Mungall, C. J., Torniai, C., Gkoutos, G. V, Lewis, S. E., & Haendel, M. A. (2012). Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1), R5. doi:10.1186/gb-2012-13-1-r5

How we use logical constraints and OWL reasoning for data quality checks in GO:

Mungall, C. J., Dietze, H., & Osumi-Sutherland, D. (2014). Use of OWL within the Gene Ontology. In M. Keet & V. Tamma (Eds.), Proceedings of the 11th International Workshop on OWL: Experiences and Directions (OWLED 2014) (pp. 25–36). Riva del Garda, Italy, October 17-18, 2014. doi:10.1101/010090

karlcz commented 8 years ago

I'm not entirely sure I understand your requested feature nor am following your argument...

We already have an extensible annotation mechanism to add meta-model annotations on schema, table, column, key, and foreign key model elements. This is intended to allow other content interpretations to be bootstrapped on top of the relational store, by consumers aware of such meta-modeling concepts. So far, these are purely advisory and do not change the behaviors of the ERMrest APIs themselves since they are operating purely on relational storage.
You could always introduce a new annotation key to describe how to generate RDF statements from a table, i.e. associating a predicate URI with each column. You can treat each row as a "blank node" for a set of statements, or even use an ERMrest URI (i.e. attribute-based naming) to produce a row URI. I am afraid there are already too many ways to do this, and we won't help much by defining yet another relational to graph mapping system. I am hopeful that such a mapping system could be referenced and cribbed here, if there are collaborations where such interpretations are really useful.
No matter what, we allow arbitrary ERM modeling by catalog owners, and certainly cannot force all models to have semantic annotations nor all clients to be semantic web aware. Hence, you'd always have to be able to gracefully degrade to a purely relational interpretation or ignore all content that lacks semantic annotation. If anything, I think we'll be adding features to lower the barrier to entry for even less disciplined collaborations where models are not well understood.
We have tried to allow PostgreSQL features to compose with our basic REST API. As a practical matter, we admit native SQL administrative tasks to accomplish things beyond the REST options. For example, we currently use row-level security policies, trigger procedures, and other check constraints in some projects even though these are not meaningfully introspected by ERMrest. While we may eventually extend the API to capture some idioms, I think there will always be a gap between what the server might do and what we can completely model in our catalog introspection. (If for no other reason than triggers and default expressions can use stored procedures that you can never introspect completely.)
I have wondered whether we should put effort into exposing SQL domains as well as the basic types and foreign key constraints we already expose. I can imaging a meta-model assertion that a particular value domain maps to an ontological class...?

One thing you may not know: we had a previous product that was a collaborative graph store (though not an RDF graph store). We've made a conscious effort to move the collaboration space to relational models and away from arbitrary triple graphs with this product. Our experience has been that the ERM concepts of tabular data and basic data-integrity rules are more intuitive to our target audience...

robes commented 8 years ago

Yes, so we want to see what we can do within our annotation mechanism without resorting to changes to the ERMrest protocol and basic data concepts. We've had this debate internally for some time about whether to make vocabulary concepts a first class citizen in ERMrest. I think @cmungall 's suggestions should be considered but can we support them (at least partially) through annotations on the model and then guidelines on how clients should interpret and use these.

karlcz commented 8 years ago

@cmungall I've been pondering your proposal in this issue description and have some more comments. I think I understand better now what you are proposing.

Regarding point 1 (ontology classes for columns)

My main question/concern is the relational integration:

Wouldn't we want data-integrity constraints for these columns, i.e. actual foreign key or domain constraints in the catalog content?
Wouldn't we still need to talk about the encoding of the column? Unless you intend that the full concept URI would be stored in each column value, I think we'd need further configuration for whether/how shortened identifier formats are being applied.

A while ago, we spit-balled a way to adopt some Chado concepts for similar goals. Namely, to introduce slightly denormalized cvterm and cvtermpath tables into an ERMrest catalog using textual dbxref identifiers as keys. This ignores any broader semantic web integration in favor of just focusing on controlled vocabulary terms as values in relational tuples.

The dbxref would seem to be a reasonable serialized term format to actually exchange in ERMrest JSON or CSV data tables.

Also, we could define domain tables to subset the cvterm table (via foreign-key reference) and use further foreign key references to indicate that an application data column must store dbxrefs from a particular restricted vocabulary. Can I restate your proposal as an annotation to describe how these domain tables are populated as a set of ontological class constraints, beyond just having the enumerated term set itself?

Regarding point 2 (more general semantic graph transcription)

This point was what I was reacting to in my first comment on this issue. If I understood you correctly, you made an assumption that individual values in relational tuples would serve as subjects or objects in RDF statements, and your annotations would indicate what predicate to use in those statements. Wouldn't that require the subject columns to store URIs and the object columns URIs or scalars, as appropriate for the predicate?

I think fruitful work in this area needs to address the more difficult question of entity/resource identity. To serialize ERMrest content as a graph might require the introduction of a blank node to serve as the subject for many statements transcribed from tuples in one or more tables in ERMrest. One needs to understand when a foreign key or association represents a relationship between two disjoint sets of entities versus a normalization or partitioning of one entity set into sub-graphs capturing different types of statement.

karlcz commented 7 years ago

This discussion seems to have petered out long ago, so I'm going to close this for now.

informatics-isi-edu / ermrest