jamesaoverton commented 3 years ago

RDFTab Design Steps

IRI, blank node, CURIE, label
RDF subject predicate object
really subject predicate object datatype
mention stanza column
add graph column
RDF-JSON for objects
RDF reification column
OWL annotation column
OFS for objects

We use RDF and OWL and SQL. How can we best use them together?

The elements of RDF are IRIs, blank nodes, and literals. IRIs are long and inconvenient, so lets use prefixes and CURIEs. We'll wrap IRIs in angle brackets to distinguish them from CURIEs. We'll say that an "ID" is one of a CURIE, IRI, or blank node.

RDF triples consist of a subject, predicate, and object. The subject can be an ID. The predicate can be a CURIE or IRI. The object can be one of four things: an ID, a plain literal, a typed literal, or a language tagged literal. We'll define a "datatype" to be:

"id" for an ID
"plain" for a plain literal
a CURIE or IRI for a typed literal
"@" and a language tag for a language tagged literal

In most RDF serializations the datatype comes after the literal content, e.g. "123"^^xsd:integer. But it's often better to read the datatype before you read the literal content, and while the 'object' is often a longish string, the datatype is usually short. So I think it's more convenient to put the datatype column before the object column.

Now we can represent triples in a table with four columns: subject, predicate, datatype, object. These cells will never been NULL.

We often want all the triples associated with some named subject. This is useful for term browsers and term extraction. But RDF includes linked lists and reification, and OWL includes nested structures and annotation axioms, all of which use a lot of blank nodes.

One way to keep these blank node structures together is to add a 'stanza' column which names the top-level subject for a set of triples. You can then select all the rows for a given stanza and have a subgraph with most of the triples relevant to your term. We won't need the stanza column if we use JSON structures mentioned below.

RDF also includes named graphs. We can add a graph column with an ID or "default". Note that OWL does not support named graphs.

Blank nodes can be difficult to work with. One of the many advantages of Turtle syntax is that it hides blank nodes behind [], {}, and () syntax. We can do something similar using simple JSON structures. Let's represent an RDF object by a JSON object. Where we would write { ex:o1 } in Turtle, let's write this JSON [{"object": "ex:o1", "datatype": "id"}] and call it an "object set". Where we would write [ ex:p1 ex:o1 ] in Turtle, let's write this JSON {"ex:p1": [{"object": "ex:o1", "datatype": "id"}]} and call it a "predicate map". When the 'object' column holds a predicate map, then the datatype column must be 'predicate-map'.

As mentioned above, RDF includes reification, which allows us to make statements about a triple. We could eliminate more blank nodes by keeping the RDF reification in the row of the target triple. So we will add a "metadata" column which will contain NULL or a predicate map JSON structure capturing zero or more triples with this triple as the subject. This is similar to writing RDF*. OWL has a similar "annotation axiom" system. We'll add an "annotation" column and handle it in the same way.

Finally, the RDF representation of OWL is hard to read. OWL Functional Syntax (OFN) is relatively easy to read, but we don't want to have to parse and render it when we're working in SQL. So let's use a JSON array, and shift the OFN keyword as the first element of the array, like an S-Expression from LISP. For example ["ObjectSomeValuesFrom","ex:part-of","ex:bar"]. We'll call this an "OWL Functional S-Expression" (OFS). When the 'object' column holds a predicate map, then the datatype column must be 'OFS'.

What have we got? Seven columns: graph, subject, predicate, datatype, object, metadata, annotation. The ability to represent anything encoded in RDF 1.1. Very few blank nodes to worry about. A convenient syntax for OWL. RDF and OWL and SQL living together happily.

What are we missing? Mainly a way to reason "inside" the database.

Easier RDF

https://github.com/w3c/EasierRDF discusses ways to make RDF easier. I think this design addresses some of those things, but if you see something that's within easy reach, please mention it.

Design Decisions

datatype column before object column?
settle on a convention for the datatype keywords
omit "datatype": "id" from JSON as redundant?
use one-letter names? gspdoma
extend OFS beyond OFN:
- RDF lists
- anything? extensible?
graphs as tables instead of a column?

Feedback appreciated from anyone, especially @cmungall @beckyjackson @lmcmicu @ckindermann.

cmungall commented 3 years ago

Remind me - for the proposal to use JSON objects in place of blank node syntax, would these still be subjects in other statements?

E.g.

s	p	o
["ObjectSomeValuesFrom","ex:part-of","ex:bar"]	rdf:type	owl:Restriction
["ObjectSomeValuesFrom","ex:part-of","ex:bar"]	owl:onProperty	ex:part-of
["ObjectSomeValuesFrom","ex:part-of","ex:bar"]	owl:someValuesFrom	ex:bar

I really like being able to use views or otherwise get at existentials without JSON parsing, e.g.

SELECT sc.subject AS subclass, svf.in_property, svf.filler FROM rdfs_subclass_of AS sc JOIN some_value_from AS svf ON (sc.object = svf.subject)

If this is still planned then in principle I don't care about the structure of the blank node / anonymous expression column values. Although I would like to check performance implications. I assume these are all interned.

cmungall commented 3 years ago

Re: graphs as tables vs columns. the same question could be asked of predicates. There is utility in having a table for rdfs:subClassOf etc.

However, I strongly prefer keeping the base generic, and allowing people to either make views or derived tables for different slices according to their use case. In fact it should be straightforward to write procedures in either a normal programming language or in something like plpgsql that auto-created views and tables for graph-perspectives and predicate-perspectives (you could do class perspectives too, e.g. select from owl_class, or select from obi_nnnnnn). But having the base be generic keeps things maximally simple and flexible.

jamesaoverton commented 3 years ago

I have not been planning to use JSON in the subject column. GCIs would still have blank nodes as subjects, for example.

There may be edges cases I’m missing, but the basic idea is just to collapse self-contained blank node structures into JSON, which should be equivalent to Turtle’s syntactic sugar.

jamesaoverton commented 3 years ago

Maybe I'm not understanding. In any case, these are the Thick Triples Examples we're working on.

jamesaoverton commented 3 years ago

I think there are practical benefits to keeping each graph in a separate table, mainly keeping indexes small for small graphs. Then you would JOIN the tables you want to query, or have a view of whatever. I guess if you always wanted to query over all graphs then one table would be better.

Of course it's better to measure than to guess.

cmungall commented 3 years ago

I have not been planning to use JSON in the subject column. GCIs would still have blank nodes as subjects, for example. There may be edges cases I’m missing, but the basic idea is just to collapse self-contained blank node structures into JSON, which should be equivalent to Turtle’s syntactic sugar.

OK, so it sounds like if we wanted to query ?x subClassOf ?r some ?b (i.e the same as https://cmungall.github.io/semantic-sql/OwlSubclassOfSomeValuesFrom/) we would need to parse json?

jamesaoverton commented 3 years ago

Yes, you'd use SQLite's JSON operators.

Thanks for the feedback. I think we just need to set up a thorough comparison for a whole bunch of use cases.

lmcmicu commented 3 years ago

This all seems good to me. Remind me about the reason for one letter column names ... is it just to save space (I can't imagine it would save all that much), or is there another reason?

jamesaoverton commented 3 years ago

Brevity was the only reason for the one-letter names, but clarity is more important, so I think we'll stick with the one-word column names.

I started work on this https://github.com/ontodev/tooling-comparison which I hope will be useful to make some comparisons and guide some design decisions.

ontodev / rdftab.rs

(Re)design Discussion #18

RDFTab Design Steps

Easier RDF

Design Decisions