quadstorejs / quadstore

A LevelDB-backed graph database for JS runtimes (Node.js, Deno, browsers, ...) supporting SPARQL queries and the RDF/JS interface.
https://github.com/quadstorejs/quadstore
MIT License
203 stars 14 forks source link

PERFORMANCE: push filtering and sorting operations from comunica down to quadstore (eventually through the RDF/JS Expression spec) #115

Closed jacoscaz closed 7 months ago

jacoscaz commented 3 years ago

As I am writing this, the SPARQL query engine in quadstore-comunica does all of its sorting and filtering in-memory. This leads to a variety of performance issues as even very simple queries can make Comunica go through unnecessarily large numbers of quads (often entire indexes).

As an example, any query with an ORDER BY clause on a variable bound to a BGP operation forces Comunica to accumulate all quads coming in via the iterator returned by Source#match() in memory, sort them in memory and then proceed to the next operation. This is likely what is happening in #135, where the presence of the LIMIT clause has basically no effect on performance as Comunica still has to go through the entire set of quads matching the pattern in order to sort it by the required variable before it is able to apply the limit.

Similarly, even the most basic FILTER() expressions are always applied in memory, with Comunica having to go through the entire set of quads matching the relevant pattern.

These issues can be addressed by pushing sorting and filtering operations, where possible, down to the persistence layer. Quadstore already supports range-based queries which can be used to optimize FILTER()s. ORDER BY clauses can be optimized by augmenting how the final index is selected based on a requested ordering.

Long-term, the goal is to add support for the upcoming RDF/JS Expression to both Quadstore and Comunica. However, starting with full-blown support for RDF/JS Expression might prove to be too much too soon and we've opted for an incremental approach in two stages:

  1. implement quadstore-specific actors in quadstore-comunica
  2. generalize the actors developed in stage 1. into RDF/JS Expression actors

Stage 1: quadstore-specific actors

The general idea is to add two new actors:

The plan looks roughly as follows:

1. Scaffolding

2. Sorting

3. Separate iterator instantiation from metadata

4. Data Model

5. Extracting operations

Stage 2: generic RDF/JS Expression actors

TBD

jacoscaz commented 3 years ago

@gsvarovsky @allforabit @rubensworks tagging you all as I know you're interested in this.

jacoscaz commented 3 years ago

Work ongoing in https://github.com/beautifulinteractions/node-quadstore-comunica/tree/sparql-optimization

jacoscaz commented 3 years ago

Some notes for posterity and to keep anyone who might be following this in the loop:

I had begun by thinking about optimization scenarios that could gracefully fallback to their non-optimized / raw alternatives. The idea was to modify the tree of Algebra.Operations produced by Comunica's SPARQL parser actor through a new optimizer actor that would extract optimizable operations into constructs to be passed to a quadstore-specific resolver actor via the context side-channel.

However, I fear this approach might not be a good fit for Comunica. The main issue I see is that there does not seem to be a planning phase (whereby an Algebra.Operation tree can be optimized based on whether calls to .test() methods succeed or not) that is separate from the query execution phase (whereby the entire tree is converted into chained iterators via .run() invocations). This makes cross-operation optimizations fairly complex as there is no way to ascertain whether all operations of an "optimized" tree can actually be carried out by the underlying sources without initiating the instantiation of all intermediary iterators, something which I think is too computationally heavy for the purpose of trying to figure out what can be pushed down to sources.

This led me to a different approach, which I've named best-effort optimizations with a pessimistic outlook (I can hear the groans...). The idea here would be to leave the tree as-is but enable Comunica to detect whether an operation is required or not based on metadata returned from sources and internally propagated by its actors.

Let's use ordering as an example. With this approach, the optimizer actor would parse the tree without modifying it, looking for ordering operations that might be supported. It would then pass these operations to the quadstore-specific resolver and the latter would, in turn, push them down to sources. At this point, regardless its capacity to match the requested ordering, quadstore would still use AsyncIterator's metadata features to inform Comunica about the ordering in which it returns quads. Ordering information would then be relayed throughout Comunica's internal chain up to the ordering actors, who would then be able to figure out whether to sort in memory or simply act in a pass-through manner due to quads being already ordered.

For filtering, the approach would be similar but I would simply keep the filtering actors as they are and pay the performance penalty of redundant filtering happening in-memory rather than find a way to represent all the possible ways in which quads might be filtered at the source.

@rubensworks summarized it nicely:

[...] you would basically pass down filters and ordering operations down to the sources (e.g. via the context), without modifying the actual algebra tree. These sources could then do the filtering and ordering they can, and Comunica could skip some operations if the result metadata mentions that certain things are already sorted/filtered. I think something like this could work.

This approach seems better suited to Comunica while also being, at least IMHO, much easier to think about in terms of generalizing it to the RDF/JS spec.

jacoscaz commented 3 years ago

For posterity, some work needed to be done in Comunica to allow optimizer actors to modify the context and have the modified version passed forward to actors further down the chain. PR: https://github.com/comunica/comunica/pull/808 - MERGED!

jacoscaz commented 3 years ago

Finally managed to add support for sorting. Another small step towards being able to optimize SPARQL queries by pushing some of Comunica's filtering and ordering down to Quadstore.

jacoscaz commented 7 months ago

Closing as tracked elsewhere in https://github.com/jacoscaz/quadstore-comunica/issues/4 .