tdt / core

Transform any dataset into an HTTP API with The DataTank
http://thedatatank.com
83 stars 31 forks source link

Incorrect pagination in SPARQL CONSTRUCT dataset #413

Open erikap opened 7 years ago

erikap commented 7 years ago

If the limit/offset are not specified in the SPARQL query of the dataset definition, the SPARQLController calculates the pagination. This seems to be done wrongly in case of a CONSTRUCT query.

If the query has a structure like:

CONSTRUCT { <construct_definition> } WHERE { <where_clause> }

the number of results is calculated as follows:

SELECT COUNT  as ?number WHERE { <where_clause> }

This doesn't yield a correct result in case of a CONSTRUCT query.

Thanks to @bertvannuffelen for the catch.

coreation commented 7 years ago

Indeed, isn't the problem here on how to implement counting for a construct(ed) result? On what exactly to count?

bertvannuffelen commented 7 years ago

Well, the only count you can do is by executing the construct. And then based on that result implement paging. The supplying SPARQL endpoint should/could provide pagination in this case, but not all do this (for instance Virtuoso).

So far, the only solution is to rely on the finiteness of the respons of the construct (that users do not request stupid things). Pagination can only be implemented by collecting the complete respons in a temporary structure.

coreation commented 7 years ago

@bertvannuffelen how would you handle paging in a temporary structure, am I correct in assuming that we cannot rely on a SPARQL endpoint returning the same order of triples for a construct query in consecutive calls? If that's the case are you suggesting caching the result for a query, perform in memory paging, returning the result and when a different page of the query is requested, get the cached object, page it in memory and return it to the client instead of performing the SPARQL query?

bertvannuffelen commented 7 years ago

SPARQL construct queries return always the complete answer. However it is up to the SPARQL endpoint implementation to handle the need of possible pagination. And here sits the problem. Most do not support pagination for construct queries.

so CONSTRUCT { ...} where {...} will return all information at once.

This can be the whole database e.g. use this query: CONSTRUCT { ?s ?p ?o} where {?s ?p ?o}

Now for small volumes, there is no problem. For larger volumes, clients might stumble on it. For very large volumes, the supplying SPARQL endpoint will apply a strategy to reduce the chance to die. Virtuoso does that by implementing a cut-off in the respons (the magic 10000 number - part of the virtuoso configuration). If you get 10K triples/respons rows you do not know if there were just 10K triples/respons rows or more.

I am indeed suggesting that for construct queries (for selects the current approach works fine) "caching the result for a query, perform in memory paging, returning the result and when a different page of the query is requested, get the cached object, page it in memory and return it to the client instead of performing the SPARQL query " is the approach.

I see no other alternative for the moment (unless selecting a SPARQL endpoint that implements pagination on all requests).

Constructs are actually used in the TDT setting for 2 cases: a) subject pages (provide all info about a subject) b) complete datasets For a) the caching-object will be mostly empty if the limit is not set to low. For b) case is dependent on the volume of the dataset and that is very different in that case. For the b) case with large volumes one could image to create a temporay file will the output on disk (compressed) and returning that.

coreation commented 7 years ago

L4.2 also has swappable caching mechanism, supported out of the box are file, memcached and a few others so I don't think storing the object in a file would be necessary.