Integration with existing triple stores and SPARQL engines

knowsys / rulewerk

Java library based on the VLog rule engine

Apache License 2.0

32 stars 13 forks source link

Integration with existing triple stores and SPARQL engines #221

Open marco-calautti opened 1 year ago

marco-calautti commented 1 year ago

From what I can see, vlog/rulewerk do not support SPARQL as a query language, but data can only be queried with positive literals obtained from dedicated rules of the ontology. Moreover, in accessing facts from a triple store (only Trident it seems), these are necessarily pulled as "triples", and thus one needs a manual conversion of these reified atoms to proper atoms over the ontology schema.

I would like to ask whether there are better ways to incorporate rulewerk/vlog as an OWL 2 RL reasoner in existing architectures that offer the above facilities (i.e., a SPARQL query engine, and a triple store), such as Jena, or whether there is any plan in implementing such kind of support.

larry-gonzalez commented 1 year ago

Not really

While VLog and Rulewerk don't implement SPARQL, both of them provide a mechanism (@source in Rulewerk, to be used within the rule files) to associate the output of an SPARQL query (against an SPARQL endpoint) as the extension of a predicate name

We provide some examples here

In particular, when we look at the rules that our doid example* uses, we can see:

On line 5, Rulewerk creates a predicate name diseaseId with arity 2, which extension is the result values for the variables ?disease and ?doid in the SPARQL query
```
SELECT ?disease ?doid
WHERE {
?disease wdt:P699 ?doid .
}
```
executed against wikidata query service
From line 6 to 9 and from line 10 to 14 there are two more SPARQL queries that can be interpreted analogously

* Carral, Dragoste, González, Jacobs, Krötzsch, Urbani. VLog: A Rule Engine for Knowledge Graphs. ISWC 2019.

CerielJacobs commented 1 year ago

On 19 Nov 2022, at 19:13, Larry González @.***> wrote:

Not really

While VLog and Rulewerk don't implement SPARQL, both of them provide a mechanism @.*** in Rulewerk, to be used within the rule files) to associate the output of an SPARQL query (against an SPARQL endpoint) as the extension of a predicate name

Actually, the command line interface of VLog does support SPARQL (1.0) queries, through the “query” command.

marco-calautti commented 1 year ago

On 19 Nov 2022, at 19:13, Larry González @.> wrote: Not really While VLog and Rulewerk don't implement SPARQL, both of them provide a mechanism @. in Rulewerk, to be used within the rule files) to associate the output of an SPARQL query (against an SPARQL endpoint) as the extension of a predicate name Actually, the command line interface of VLog does support SPARQL (1.0) queries, through the “query” command.

This is great actually! I guess that rulewerk does not provide an interface for accessing that command, right?

Regarding gathering data from existing triple stores, I see that rulewerk has a class for converting OWL files to rules and facts, which means it is able to convert triples directly to facts, e.g., a triple iri a :Person in a given OWL file is directly converted to a fact :Person(iri). However, the @source keyword does not do that, but just puts all triples in a ternary predicate. Is there any other way to make @source behave like the OWL to rule converter? Or do I need to write conversion rules from the ternary predicate to the actual schema manually?

If this is the case, this is actually the only roadblock I see in using vlog as an actual OWL reasoner. Accessing directly a triple store with proper facts conversion would be super handy!

Thanks for your time.

larry-gonzalez commented 1 year ago

I guess there are several points going on here:

Actually, the command line interface of VLog does support SPARQL (1.0) queries, through the “query” command.

Thanks for pointing this out. I guess my answer was not complete in several ways, :)

I guess that rulewerk does not provide an interface for accessing that command, right?

As far as I can see, Rulewerk does not provide a mechanism to answer SPARQL queries after materialization

I see that rulewerk has a class for converting OWL files to rules and facts, which means it is able to convert triples directly to facts, e.g., a triple iri a :Person in a given OWL file is directly converted to a fact :Person(iri).

True

However, the @source keyword does not do that, but just puts all triples in a ternary predicate.

No First, you don't need to load all the data. Just the output of an SPARQL query (unless you ask for all ?s ?p ?o .) Second, you don't need to create triples. The new predicate, and its associated extension can have any (equal) arity

If we look at the rules of our doid example again, in lines 6 to 9 we create a unary predicate from an SPARQL query

For a higher arity example, please consider the following @source statement

@source cities[5]: sparql(wdqs:sparql, "countryLabel,cityLabel,area,population,coordinates",
   '''?city wdt:P31 wd:Q515 ;
          wdt:P17 ?country ;
          wdt:P2046 ?area ;
          wdt:P1082 ?population ;
          wdt:P625 ?coordinates ;
          wdt:P1705 ?cityLabel .
      ?country wdt:P1705 ?countryLabel .''') .

which creates a predicate called cities of arity 5 directly from the output of the query

SELECT ?countryLabel ?cityLabel ?area ?population ?coordinates
WHERE 
{
  ?city wdt:P31 wd:Q515 ;
        wdt:P17 ?country ;
        wdt:P2046 ?area ;
        wdt:P1082 ?population ;
        wdt:P625 ?coordinates ;
        wdt:P1705 ?cityLabel .

  ?country wdt:P1705 ?countryLabel .
}

But I need to say that as long as a triple store makes available a SPARQL endpoint, then Rulewerk/VLog could collect the data

Please also note that we have a matrix support channel where you can talk directly with the developers, :)

marco-calautti commented 1 year ago

I guess there are several points going on here:

Actually, the command line interface of VLog does support SPARQL (1.0) queries, through the “query” command.

Thanks for pointing this out. I guess my answer was not complete in several ways, :)

I guess that rulewerk does not provide an interface for accessing that command, right?

As far as I can see, Rulewerk does not provide a mechanism to answer SPARQL queries after materialization

I see that rulewerk has a class for converting OWL files to rules and facts, which means it is able to convert triples directly to facts, e.g., a triple iri a :Person in a given OWL file is directly converted to a fact :Person(iri).

True

However, the @source keyword does not do that, but just puts all triples in a ternary predicate.

No First, you don't need to load all the data. Just the output of an SPARQL query (unless you ask for all ?s ?p ?o .) Second, you don't need to create triples. The new predicate, and its associated extension can have any (equal) arity

If we look at the rules of our doid example again, in lines 6 to 9 we create a unary predicate from an SPARQL query

For a higher arity example, please consider the following @source statement
@source cities[5]: sparql(wdqs:sparql, "countryLabel,cityLabel,area,population,coordinates",
   '''?city wdt:P31 wd:Q515 ;
          wdt:P17 ?country ;
          wdt:P2046 ?area ;
          wdt:P1082 ?population ;
          wdt:P625 ?coordinates ;
          wdt:P1705 ?cityLabel .
      ?country wdt:P1705 ?countryLabel .''') .
which creates a predicate called cities of arity 5 directly from the output of the query
SELECT ?countryLabel ?cityLabel ?area ?population ?coordinates
WHERE 
{
  ?city wdt:P31 wd:Q515 ;
        wdt:P17 ?country ;
        wdt:P2046 ?area ;
        wdt:P1082 ?population ;
        wdt:P625 ?coordinates ;
        wdt:P1705 ?cityLabel .

  ?country wdt:P1705 ?countryLabel .
}
But I need to say that as long as a triple store makes available a SPARQL endpoint, then Rulewerk/VLog could collect the data

Please also note that we have a matrix support channel where you can talk directly with the developers, :)

Thanks a lot for the answer! Regarding the last point, that's precisely the limitation I think vlog has at the moment: yes, indeed we can get specific facts from a sparql end point, but the problem is that, if my ontology has say, 100 classes, if I want to import all individuals, for all 100 classes as facts, i need either 100 @source statements to explicitly fill each predicate, or I need 1 @source statement importing all triples, and then 100 rules, that convert the ternary predicate to actual facts. In both cases, importing data that uses a large vocabulary is a bit inconvenient at the moment, and makes using vlog as a a reasoner not easy with complex ontologies.

I believe that what would make vlog much more practically usable is some construct that with, e.g., a single @source statement we actually convert each triple of the form "individual a class" directly to a fact class(individual). (This is precisely what the OWL to rules convert does, but the @source command does not).

mkroetzsch commented 1 year ago

Exposing the VLog SPARQL feature through Rulewerk would be very useful: now tracked here #222

mkroetzsch commented 1 year ago

@marco-calautti The kind of translation that you have in mind is supported by the OWL module of Rulewerk. It will create rules and facts from OWL RL ontologies. This functionality is available in Java but could be added to the Rulewerk client or maybe even to syntax declaration. However, that may not be the best approach for larger ontologies: even if this safes you from using 100 import rules for 100 classes, rule reasoners (essentially all of them, I think), will not like it if you use thousands of rules and predicates. For example, the SNOMED CT ontology has some 300,000 classes and a similar number of axioms. On this scale, what you are trying is not going to work well.

Instead, a better way is to represent classes as constants (not predicates) and use a predicate to relate individuals to classes. One can then implement a reasoner with a few rules (around 10-20) and this will scale to 100 thousands of classes and axioms (the OWL axioms, too, would be represented as facts, one per axiom, rather than as rules). A worked example for how to do this in OWL EL was described in our ECAI tutorial, see second session (the Rulewerk rules are in the zip file). For OWL RL, one can use similar rules. For instance retrieval, the rules can be easily obtained by rewriting the official OWL RL rules in the W3C specification (encoding classes and properties on a constant position): the rules are exactly as in the specification, one just has to use the suitable encoding of their premise and conclusion in Datalog.

If you want to do terminological reasoning (computing OWL RL subclass entailments), a suitable rule set can be found in my paper The Not-So-Easy Task of Computing Class Subsumptions in OWL RL (ISWC2012). The rules there use the notation with the horizontal bar instead of a ->, but otherwise are perfectly Datalog.

I think such an approach should scale much better to ontologies with a larger schema.

marco-calautti commented 1 year ago

Your approach seems very reasonable. The only caveat I see is that this way of encoding classes will completely defeat the purpose of using SPARQL queries, as one cannot query the materialized facts using the actual schema, but instead using the reified one. I might be wrong in this, but if you have an ontology with 300.000 classes, this has simply not been conceived to be used with ABox reasoning in the first place, but only with TBox reasoning (which is what EL has been designed for). So, such ontologies might not even fall in the use cases of a Datalog engine.

But I agree that one could make vlog scale well in these cases with your proposed solution, so of course it is up to you if you ever want to implement such a feature!

-EDIT- (Actually, the best of both worlds would be to implement your approach for encoding rules, and then have a translation layer that converts SPARQL queries over the ontology schema back to the triple-based one).

mkroetzsch commented 1 year ago

I don't see why you could not use SPARQL for querying OWL if you encode classes as constants. It rather seems the opposite is the case: you cannot use SPARQL if you map classes to predicates. The reason is that SPARQL is an RDF query language that cannot represent unary and binary predicates syntactically. If you want to query something in SPARQL, you need to first map it to an RDF encoding, and this will lead to classes occurring in the position of RDF resources (i.e., constants). Doing this is easy with a few "output rules" if your classes are already constants in Datalog. If you use Datalog predicates for OWL classes, then you will again need hundreds of rules.

In fact, the W3C RDF encoding of OWL is very much an encoding that treats classes as constants already.

(Of course, there are all kinds of challenges if you want to handle arbitrary SPARQL queries that use W3C OWL vocabulary, but these issues occur in any case. I suppose we are talking about queries here that correspond to conjunctive queries only, without any meta-querying,)

mkroetzsch commented 1 year ago

P.S. For us, it is also interesting to understand the envisioned usage scenario. For example, I was curious to understand why you would prefer the OWL loading feature in a @source declaration instead of just using a short Java program that calls the OWL module code on its input. Do you expect your users to interact with Rulewerk syntax directly, so that no wrapper code that calls the OWL module would be possible (or would be a burden to the user)?

marco-calautti commented 1 year ago

My use case is the following:

I have an OWL ontology over a certain vocabulary, and would like to export a SPARQL end-point that allows users to query the materialization of the rule version of this ontology, together with facts coming from a triple store, using the ontology vocabulary. So, for example I could write SELECT ?x WHERE { ?x a :Person }.

For this, in the current state of the rulewerk library/vlog, I believe I would need to:

Convert the OWL ontology to rules using the OWL to rules class. This will give me a set of existential rules which work explicitly on the ontology vocabulary.
Let vlog have access to the triples of the triple store (e.g., Jena). For this, I must use one @source command using a SPARQL end-point (e.g., Jena), for each predicate (classes and properties).
Run the chase over the resulting knowledge base, and expose the vlog SPARQL query facility.

If I use instead a single @source command, and the ternary "triple" predicate approach, instead of multiple source commands, I do not see how I can make SPARQL queries like the one above to work over the materialized instance, which is just a set of facts of the form triple(a,b,c). Unless I am missing something fundamental here.

So, assuming 2) is the only way to go to achieve what I need, to implement 2), I would need to write a (simple) piece of java program that automatically generates all the needed @source commands. But I felt that as a reasoner, vlog could better integrate with triple stores, and let me avoid do this automatic @source command generation explicitly, at least with its builtin triple store Trident. That's it.

(The above workflow is similar to what happens in other triple stores that support OWL reasoning (such as Jena, GraphDB, Stardog ecc.). I feed the system with the OWL ontology and facts in the form of triples, which are then stored on disk by the triple store (e.g., Jena uses TDB), and then query the inferred knowledge using SPARQL.)

marco-calautti commented 1 year ago

For the time being, I solved the above by following your suggestion, and I extended the OWLToRulesConverter class to be able to also convert OWL ontologies to rules using a single ternary predicate. (so, the original ontology classes and roles are now encoded as constants). In this way, I can use a single @source to load triples from trident. However, without having SPARQL in rulewerk for querying, I am now limited to use PositiveLiterals for queries. Having SPARQL querying directly in rulewerk would definitely be super helpful!

Thanks for your time.