w3c / EasierRDF

Making RDF easy enough for most developers
267 stars 13 forks source link

Composition of named graphs #57

Open dbooth-boston opened 5 years ago

dbooth-boston commented 5 years ago

Named graphs provide a convenient way to group data. But there is no easy standard way to combine them! For example, I would like to be able to say that one graph is composed of several other graphs. Or I would like to apply a reasoner to one graph, to produce results in another graph.

The RDF Pipeline Framework is one attempt to address this (though it still needs a lot more development).

It would be good to have standard ways to express graph composition and manipulation.

"most of all I’d love to see a generic grouping mechanism that is more powerful than RDFs specification of Named Graphs, supporting nesting and composition of named graphs and identification/reification of statements in named graphs (vulgo: quints). Quints are my favoured hammer and they fit many nails".

https://lists.w3.org/Archives/Public/semantic-web/2018Nov/0170.html

draggett commented 5 years ago

This should include unnamed graphs as well as named graphs. In principle, unnamed graphs will have an internal identifier which could be implicit (e.g. as in RDF*) or exposed via an API or query syntax. I am exploring some ideas for how to make this simple to use.

dbooth-boston commented 5 years ago

Agreed. I should have said that explicitly: both named and unnamed graphs.

amirouche commented 5 years ago

I would like to be able to say that one graph is composed of several other graphs. Or I would like to apply a reasoner to one graph, to produce results in another graph.

This should not be in any standard. I think the best approach to named graphs is the quad store. Basically, it a triple store with an extra column that one might call Collection instead. Then getting composition of collection is an advanced use which boils down to symlink a collection inside another collection relying on the reasoner to traverse the different graphs depending on the query. Anyway, I already thought about things it is very advanced and difficult to query anyway.

draggett commented 5 years ago

@amirouche why do you think that we don't need to standardise the ability to express graph compositions? Is that you think the full range of requirements can be handled in some other way?

I see the potential for annotating arbitrary collections of triples. A given triple could occur in multiple collections, or collections of collections. These could be temporary or persistent. It would be useful to know that a given identifier is for a collection of triples without first needing to dereference it.

amirouche commented 5 years ago

It seems to me that existing standards already allow to express collection in collection kind of relation.

dbooth-boston commented 5 years ago

@amirouche , can you please clarify a couple of points?

I think the best approach to named graphs is the quad store

Agreed, but that is just the implementation. I have not seen standard ways to manipulate those graphs, such specifying that one graph should be composed of two others, or that one graph should hold the result of applying a set of rules to another graph.

existing standards already allow to express collection in collection kind of relation

Which standards? Can you give an example of how this is expressed between named graphs? I'm not following what you mean.

amirouche commented 5 years ago

Sorry for the late reply.

I have not seen standard ways to manipulate those graphs, such specifying that one graph should be composed of two others, or that one graph should hold the result of applying a set of rules to another graph.

Can you give an example of how this is expressed between named graphs?

It seems to me it can be expressed in terms of reasoner / rule engines. I am not sure anymore if rule engines are part of RDF.

madnificent commented 5 years ago

We use graphs to enforce access rights using query rewriting. Being able to perform set operations on graphs whilst executing the SPARQL query would have greatly minimized the effort. Even now, it would increase the practical expressiveness of the solution.

dbooth-boston commented 5 years ago

@madnificent, I am curious about the query rewriting that you mention. Can you explain a little more about it?

madnificent commented 5 years ago

Sure thing.

Some context: We have a microservices architecture in which microservices write/read data from a SPARQL endpoint in a shared semantic model. Splitting off access rights helps microservice reuse (see On microservice reuse and authorization). General concept was first coined at ESWC2015/USEWOD (direct link).

All information regarding the application is stored in the triplestore in a well known manner. We have an authorization layer around the SPARQL endpoint used by our services (see: mu-authorization). When a request comes in the microservice receives the session URI and forwards it with each request to the triplestore. Based on the session URI and information in the triplestore, the triplestore itself can identify the access rights of a user. These access rights are shared through the stack (see On sharing authorization). Based on these access rights we can rewrite the SPARQL query so only the right content can be seen or manipulated.

The current authorization system consumes these access rights. It parses the received SPARQL query and converts it into a series of objects as per SPARQL1.1 EBNF. Based on the type of query, it is manipulated in order to read content from the right graphs and to write it to the right graphs. Reading is currently done by wrapping statements into GRAPH/UNION statements or by adding FROM statements. Writing is a bit more complex. We first materialize all triples to INSERT/DELETE by executing the WHERE blocks, interpreting them and materializing the INSERT/DELETE template that belongs with it. The quads which need to be inserted are then compared to the access rights and are scoped to be INSERTED/REMOVED into the right graphs. Pushing the changes through other consumers in the stack allows to clear caches.

@langens-jonathan and I set out strategies for sharing information between actors at Semantics 2016. However, we ran into logical problems with respect to the options of SPARQL queries. From the top of my head (correct me if I'm wrong @langens-jonathan), we figured out there's (active or passive) pushing of information, (active or passive) pulling of information, and a hive-mind in which people cooperate on the same dataset. In order to materialize the first four cases there's the need to overwrite data. Hence the triplestore would need the option to state "I want to query query on all information from this first graph MINUS this second graph.". We don't seem to be able to express this right now without materializing the data in the triplestore.