spaziocodice / SolRDF

An RDF plugin for Solr
Apache License 2.0
114 stars 20 forks source link

Solr-specific query optimizations #96

Open agazzarini opened 9 years ago

agazzarini commented 9 years ago

The first implementation step of the Solr-Jena bridge has been actually completed: as suggested by Jena devs, that is basically a Solr-specific implementation of the Jena graph and dataset domain model.

Now, it's time to go ahead with non-functional requirements, efficiency first of all: the default behaviour of Op and related classes (in general I think a lot of things that are in charge to manage the query algebra and execution) needs to be adapted / specialized in order to provide Solr-specific optimizations.

As I almost ignorant about those topics, I'm trying to study them, but I believe it will take me a bit of time. If there's someone who is more expert than me (very easy) or simply wants to join this adventure, feel free to give me a shout ;)

agazzarini commented 9 years ago

The first step is a Solr-specific implementation of OpBGP and corresponding execution plan.

An idea (that I'm testing) is:

In this way the total number of operations needed should be smaller than the current (default) implementation.

agazzarini commented 9 years ago

A great step ahead: I created the first working version of the Jena StageGenerator, which is in charge to execute and resolve Basic Graph Patterns (BGPs), the SPARQL building blocks.

It leverages low-level Solr / Lucene stuff in order to speed up and optimize the patterns execution. At a first glance, I see good results so it seems the idea could work. However, I need

agazzarini commented 9 years ago

The stuff above has been committed in a dedicated branch - issue_89 - so it's not in the master

agazzarini commented 9 years ago

Still a lot of things to do. I'm trying to build a bridge between the Jena Op / OpExecutor framework and the Solr world. The general and overall iterator behaviour of Jena classes (i.e. QueryIterator) sometimes doesn't fit very well with the Solr logic especially when a lot of members participate in the query execution plan. Something, for example, like this:

(project (?first ?last ?workTel)
  (conditional
    (filter (> ?amount 10000)
      (bgp
        (triple ?s <http://learningsparql.com/ns/addressbook#firstName> ?first)
        (triple ?s <http://learningsparql.com/ns/addressbook#lastName> ?last)
        (triple ?s <http://learningsparql.com/ns/addressbook#portfolio> ?amount)
      ))
    (bgp (triple ?s <http://learningsparql.com/ns/addressbook#workTel> ?workTel))))

project (?first ?last ?workTel)
  (filter (> ?amount 10000)
    (leftjoin
      (bgp
        (triple ?s <http://learningsparql.com/ns/addressbook#firstName> ?first)
        (triple ?s <http://learningsparql.com/ns/addressbook#lastName> ?last)
        (triple ?s <http://learningsparql.com/ns/addressbook#portfolio> ?amount)
      )
      (bgp (triple ?s <http://learningsparql.com/ns/addressbook#workTel> ?workTel)))))

So what I'm trying to do is a new set of classes that act as reducers from a given algebra expression to a Solr DocSet. These classes also needs to implement the Jena QueryIterator interface in a lazy way....that is: when Jena asks for Bindings or QuerySolutions they will produce them on-demand. Before of that, they will work only with Solr / Lucene data model, optimizing and compacting the operations according with the corresponding query parser capabilities.

agazzarini commented 9 years ago

A first implementation of Basic Graph Pattern execution seems working. It works directly at Lucene low-level, executing subsequent joins between docsets (resulting from each triple pattern in the graph).

Again, the underlying idea seems working but needs some more time: I tried running the integration suite and there are some expected failures (but also a lot of green tests) so the issue_89 branch is definitely unstable.

agazzarini commented 9 years ago

The issue_89 branch contains a rough implementation of

There are still 14 failures and 8 errors in the SELECT tests. They are mainly