Closed cmungall closed 7 years ago
I'm not entirely sure I understand your requested feature nor am following your argument...
One thing you may not know: we had a previous product that was a collaborative graph store (though not an RDF graph store). We've made a conscious effort to move the collaboration space to relational models and away from arbitrary triple graphs with this product. Our experience has been that the ERM concepts of tabular data and basic data-integrity rules are more intuitive to our target audience...
Yes, so we want to see what we can do within our annotation mechanism without resorting to changes to the ERMrest protocol and basic data concepts. We've had this debate internally for some time about whether to make vocabulary concepts a first class citizen in ERMrest. I think @cmungall 's suggestions should be considered but can we support them (at least partially) through annotations on the model and then guidelines on how clients should interpret and use these.
@cmungall I've been pondering your proposal in this issue description and have some more comments. I think I understand better now what you are proposing.
My main question/concern is the relational integration:
A while ago, we spit-balled a way to adopt some Chado concepts for similar goals. Namely, to introduce slightly denormalized cvterm
and cvtermpath
tables into an ERMrest catalog using textual dbxref
identifiers as keys. This ignores any broader semantic web integration in favor of just focusing on controlled vocabulary terms as values in relational tuples.
The dbxref would seem to be a reasonable serialized term format to actually exchange in ERMrest JSON or CSV data tables.
Also, we could define domain tables to subset the cvterm
table (via foreign-key reference) and use further foreign key references to indicate that an application data column must store dbxrefs from a particular restricted vocabulary. Can I restate your proposal as an annotation to describe how these domain tables are populated as a set of ontological class constraints, beyond just having the enumerated term set itself?
This point was what I was reacting to in my first comment on this issue. If I understood you correctly, you made an assumption that individual values in relational tuples would serve as subjects or objects in RDF statements, and your annotations would indicate what predicate to use in those statements. Wouldn't that require the subject columns to store URIs and the object columns URIs or scalars, as appropriate for the predicate?
I think fruitful work in this area needs to address the more difficult question of entity/resource identity. To serialize ERMrest content as a graph might require the introduction of a blank node to serve as the subject for many statements transcribed from tuples in one or more tables in ERMrest. One needs to understand when a foreign key or association represents a relationship between two disjoint sets of entities versus a normalization or partitioning of one entity set into sub-graphs capturing different types of statement.
This discussion seems to have petered out long ago, so I'm going to close this for now.
Feel free to close if this is not in scope. We discussed this briefly at the OHSU meeting.
I propose extensions to the schema language in order to semantically annotate column and table definitions. Roughly:
Motivations
In addition to providing explicit formal documentation on what the meaning of the schema is, this can deliver some pragmatic advantages.
For (1), this can be used to drive data entry and curation interfaces. For example, the type field could be fed into the SciGraph autocomplete interface, so that for example a curator is never offered a phenotype term when they are entering wild-type gene expression.
For the combination of 1+2, we are essentially defining an RDF/OWL model for any ERMrest schema. This can deliver a number of advantages. From a selfish monarch/facebase perspective it means we never have to write any explicit dipper code, we can have a generic mapping to go from any ermrest data into out database.
I think this will have some long term advantages to the ermrest framework. It will allow the simple integration of semantic constraints into ermrest. As a guiding example, consider an anatomical structure such as the frontonasal prominence. The ontology encodes a number of semantic constraints such as: (i) the structure ceases to exist post-embryonically (in particular species we can be more precise, e.g. CS18 in human); (ii) the structure is only found in vertebrates. Relational integrity constraints are insufficiently expressive to detect invalid states (e.g. if a gene is expressed in this structure in an adult, or in a non-vertebrate), but an OWL reasoner is well-equipped for performing these checks, using axioms encoded in ontologies such as Uberon. To run these checks it's necessary to translate the data to triples using the same relations and design patterns as in the ontology. This translation can be automated if the schema is sufficiently well-described.
Implementation
There are existing mapping languages for SQL to RDF, e.g. D2RQ, but these are the wrong level of abstraction here.
Elements of JSON-LD should definitely be used, e.g. CURIE expansion at a minimum.
If this is a promising idea, I can draft more ideas on what the extension would look like.
I don't know how much work would be required on the ermrest framework. As a first pass, the extensions could be functionally silent, with functionality coming later.
cc @mellybelly @robes @kshefcheck @TomConlin
References
Anatomical constraints described in:
Mungall, C. J., Torniai, C., Gkoutos, G. V, Lewis, S. E., & Haendel, M. A. (2012). Uberon, an integrative multi-species anatomy ontology. Genome Biology, 13(1), R5. doi:10.1186/gb-2012-13-1-r5
How we use logical constraints and OWL reasoning for data quality checks in GO:
Mungall, C. J., Dietze, H., & Osumi-Sutherland, D. (2014). Use of OWL within the Gene Ontology. In M. Keet & V. Tamma (Eds.), Proceedings of the 11th International Workshop on OWL: Experiences and Directions (OWLED 2014) (pp. 25–36). Riva del Garda, Italy, October 17-18, 2014. doi:10.1101/010090