knocean / knode

Knowledge Development Environment
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Implement Jena DatasetGraph #63

Open jamesaoverton opened 6 years ago

jamesaoverton commented 6 years ago

In #62 we're working on an LDF server, backed by a vector-of-maps representation of quads that should also fit nicely into a SQL table. By pointing an LDF client at this server, we'll be able to run SPARQL queries. This is a nice solution for many use cases, such as low server loads for unauthenticated users, and I'm excited to try it out. But for many kinds of queries I expect it to be very slow.

Now I want to use that same vector-of-maps with Apache Jena, so we can run SPARQL on the same data more efficiently (and with higher server loads). The basic approach is implemented by LinkedDataFragments/Client.Java. LDF is designed to make a lot of small queries in parallel, so it's not a good fit for Jena. Our implementation will have a single local datastore, so it can be much simpler and should work just fine.

This is the basic Clojure code for using Jena to load a graph and runs a SPARQL query:

(ns org.knotation.jena
  (:import (org.apache.jena.riot RDFDataMgr)
           (org.apache.jena.rdf.model ModelFactory)
           (org.apache.jena.query QueryFactory QueryExecutionFactory)
           (org.apache.jena.graph NodeFactory Triple)
           (org.apache.jena.graph.impl GraphBase)))

(def g (RDFDataMgr/loadGraph "junk/root.owl"))
(def m (ModelFactory/createModelForGraph g))
(def q (QueryFactory/create "select * where {?s <http://example.com/p> ?o ; ?p \"1\"}"))
(->> (QueryExecutionFactory/create q m)
     (.execSelect)
     iterator-seq
     first)

In LinkedDataFragmentGraph.java they extend Jena's GraphBase to make their own Graph class. GraphBase takes care of most of the work, you just need to implement graphBaseFind(Node subject, Node predicate, Node object). The return type is a little weird, so I used WrappedIterator and reified Iterator. Here's a proof-of-concept implementation that worked for me:

(defn make-triple
  []
  (new Triple
       (NodeFactory/createURI "http://example.com/s")
       (NodeFactory/createURI "http://example.com/p")
       (NodeFactory/createLiteral "1")))

(defn seq->iterator
  [xs]
  (let [xs (atom xs)]
    (reify
      java.util.Iterator
      (hasNext [this] (not= 0 (count @xs)))
      (next [this] (let [x (first @xs)] (reset! xs (rest @xs)) x))
      (remove [this]))))

(def g
  (proxy [GraphBase] []
    (graphBaseFind
      ([s p o]
       (println "THREE" s p o)
       (org.apache.jena.util.iterator.WrappedIterator/create
        (seq->iterator
         [(make-triple)]))))))

That code generates a Graph that Jena can run SPARQL against. This is just for triples. I also want to support quads, so we need to implement DatasetGraphBase. That will require returning Graphs as above. Note that Jena's interfaces for Triples and Quads have several differences (for historical reasons, I guess.)

There are details to figure out. The sketch of seq->iterator above isn't ideal. The first implementation should probably be backed by an atom storing the vector-of-maps. I would also like an implementation backed by SQL.

For the current purposes, our graphs should be immutable, so we don't need to implement Jena's methods for adding and removing triples.

jamesaoverton commented 6 years ago

I built my proof-of-concept using [org.apache.jena/jena-arq "3.6.0"], with [org.slf4j/slf4j-nop "1.7.12"] to suppress logging.