quadstorejs / quadstore

A LevelDB-backed graph database for JS runtimes (Node.js, Deno, browsers, ...) supporting SPARQL queries and the RDF/JS interface.
https://github.com/quadstorejs/quadstore
MIT License
202 stars 14 forks source link

SPARQL join tests & fixes #127

Closed gsvarovsky closed 3 years ago

gsvarovsky commented 3 years ago

Trying to track down some issues with SPARQL joins. There are two problems; I have only fixed one. See test/sparql/sparql.select.join.js:

  1. Doing a basic join with two quad patterns (should join quad patterns) passes its test only in MemDOWN. I have scratched my head about why this might be but to no avail. @jacoscaz it would be great if you could have a quick look.
  2. Doing a join where some of the candidate matches considered during the query have a literal in the Subject position was causing an Error to be thrown inside quadstore. I have fixed this by checking for Term Types in unexpected positions. In doing so I noted that the match API in quadstore is more restrictive than that in the RDFJS Store API. It appears that Comunica relies upon the looser definition taking Terms for all positions.
jacoscaz commented 3 years ago

This is very weird! When run separately, the two SELECT queries with ?s <http://ex.com/knows> ?s2 and ?s2 <http://ex.com/greeting> "hi" seem to work all right on all backends. Furthermore, Comunica doesn't seem to be doing a nested join the way I had expected. I'll look into this and let you know.

jacoscaz commented 3 years ago

In the meantime, the RDF 1.1 spec doesn't seem to allow Literal nodes as subjects: https://www.w3.org/TR/rdf11-concepts/#section-triples

gsvarovsky commented 3 years ago

RDF 1.1 spec doesn't seem to allow Literal nodes as subjects

Indeed not. However, it seems that when the variables candidate bindings are being explored, Comunica nevertheless can end up calling match with a literal as the first argument. Hence, my fix is to make quadstore allow that, but simply return an empty iterator. According to the strict Store API spec, any Term can be used, so I think my fix is correct. Maybe @rubensworks can confirm.

jacoscaz commented 3 years ago

@rubensworks a couple of questions for you:

1) When using patterns like this one

?s2 <http://ex.com/greeting> "hi" .
?s ?p ?s2 .

Comunica seems to invoke the match() method of RDF/JS sources even when s2 is bound to a Literal, which should make it unqualified to be used as a subject if I understand correctly. Leaving aside that quadstore is making this harder to debug because there's something preventing the resulting errors from bubbling up to (or through) Comunica, how should I address cases like this? Should I just return an empty set/iterator of quads?

2) We have a test case in which the following query results in an empty set of quads even though it shouldn't. This only happens with disk-based storage backends and doesn't happen with the in-memory backend.

      SELECT * {
          ?s <http://ex.com/knows> ?s2 .
          ?s2 <http://ex.com/greeting> "hi"
        }

What's strange is that if I run two separate queries, one for each of those patterns, I get the expected results and I am able to join them by hand. As all the backends are abstracted via the AbstractLevelDOWN interface, could there be a timing issue somewhere? Have you ever encountered anything like this?

gsvarovsky commented 3 years ago

Thanks for having a look @jacoscaz. With regard to the Leveldown behaviour being different to Memdown, I've been experimenting with a hypothesis that there is a race condition – in Memdown the race is always won because of the very fast query response, on the microtask queue. In particular, I thought it was suspicious that quadstore's match implementation has to wait for a promise to return before setting the source on the return iterator here: https://github.com/beautifulinteractions/node-quadstore/blob/560b28204150ff1dc047788c37cf8d8e8c58dbd2/lib/quadstore.ts#L178

TransformIterator is able to take a promise as a first argument, so I tried using the promise from the getStream call. All other tests pass, but it doesn't fix the problem, sadly. Investigations continue!

jacoscaz commented 3 years ago

Quick update on this - I've tried refactoring getStream() into a synchronous method so that match() immediately gets the actual iterator rather than the pass-through transform iterator but doing so did not make any difference WRT to the join issue. I wonder if there's something in the lib/get/leveliterator.ts wrapper that is causing this?

rubensworks commented 3 years ago

Comunica seems to invoke the match() method of RDF/JS sources even when s2 is bound to a Literal, which should make it unqualified to be used as a subject if I understand correctly.

Indeed, strictly speaking, RDF does not allow literals in subject position. Comunica deliberately does not strictly check this, as some users may want to use it for querying under generalized RDF mode.

Furthermore, always checking term types before binding may introduce some performance overhead, so Comunica expects that the store simply returns empty results in those cases. This should be in line with the semantics of RDF/JS, where generalized RDF is also possible by default.

We have a test case in which the following query results in an empty set of quads even though it shouldn't.

In case you are manually creating asynciterators, you could try enabling the autoStart flag. This may have something to do with it. If you're not creating such iterators, then I don't really have an idea what could be causing this.

jacoscaz commented 3 years ago

so Comunica expects that the store simply returns empty results in those cases

Sounds good! I was not aware of the generalized RDF mode.

you could try enabling the autoStart flag

Good idea, I'll test this shortly. Thanks!

jacoscaz commented 3 years ago

I think I've found the culprit. There are cases in which store.db.approximateSize() rounds the actual non-zero size to 0. This leads Quadstore.prototype.getApproximateSize() to return 0, which in turn causes Quadstore.prototype.countQuads() to return 0, which in turn tricks Comunica into thinking there are no quads available.

jacoscaz commented 3 years ago

@gsvarovsky I've just pushed the fix for the broken JOIN queries to master, including your tests. Can you test it?

jacoscaz commented 3 years ago

@gsvarovsky just pushed the fix for breaking queries when in generalized rdf mode with a literal subject. I'll release tomorrow or whenever you get a chance to validate these fixes. Just FYI, a more definitive fix (and potentially support for generalized rdf mode, not sure yet) will come in v8.0, for which I'm planning a significant upheaval of the (de)serialization mechanism that will save a ton of storage space and should get us much closer to the raw performance of the AbstractLevelDOWN backend.

gsvarovsky commented 3 years ago

Super! I've tested locally using the quadstore unit tests and m-ld's unit test suite with Memdown. I'm buried in a branch which is not ready for integration tests (where persistence is tested), so if you don't mind, I'll take it on faith that Leveldown works too, for now.