kjetilk / p5-atteanx-query-cache

Experimental prefetching SPARQL query cacher, take 2
0 stars 1 forks source link

Cartesian joins caused by cached triples #2

Closed kjetilk closed 8 years ago

kjetilk commented 8 years ago

In #1 , we argued that cartesian joins should not be evaluated by the remote endpoint. In the test "3-triple BGP where cache breaks the join to cartesian", the situation is that there is a chain-shaped query, and the midle TP is cached:

- Hash Join { s } (distinct)
-   Quad { ?a, <c>, ?s, <http://test.invalid/graph> }
-   Hash Join { o }
-     Table (?s, ?o)
-       {o=<http://example.org/baz>, s=<http://example.com/foo>}
-       {o=<http://example.org/foobar>, s=<http://example.com/foo>}
-       {o=<http://example.org/bar>, s=<http://example.org/foo>}
-     SPARQLBGP
-       Quad { ?o, <b>, "2", <http://test.invalid/graph> }

In this case, the reason why the cartesian arises is the presence of a cached TP result. This could be a really bad thing for the remote endpoint, and it would possibly be better to evaluate the entire query remotely.

Presently, we assume that if the cache is present, it would always evaluate that part locally.

kjetilk commented 8 years ago

The current code is able to generate plans that have both a full SPARQL query and a broken up query. Thus, it is now entirely up to the cost model in #4 to address this problem.

kasei commented 8 years ago

The plan shown here doesn't have a cartesian product...(?) Is the issue that it has two different Quads that can't be joined in a single BGP?

kjetilk commented 8 years ago

Yeah, but wouldn't

SPARQLBGP
-   Quad { ?o, <b>, "2", <http://test.invalid/graph> }
-   Quad { ?a, <c>, ?s, <http://test.invalid/graph> }

mean that the remote endpoint has to evaluate a cartesian?

kjetilk commented 8 years ago

Given that the SPARQLES survey found that it is unlikely that we will get complete results for a many single-quad BGPs, and that we're not committing to a very elaborate cost model, I think we can close this. The current code will pass such a BGP to a remote endpoint, even though it means that it doesn't use a cached result. Also, a better solution to this problem would probably build on Maribel's SHEPHERD work, that hasn't been published in full yet, so this is a reasonable future work.