openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
863 stars 210 forks source link

sparql: counts don't seem to be reliable #112

Open joernhees opened 10 years ago

joernhees commented 10 years ago

I'm trying to get a top type count for DBpedia (Virtuoso version 07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):

select ?type count(distinct ?s) as ?c where {
  ?s a ?type.
}
group by ?type
order by desc(?c)
limit 50

returns (apart from other rows) this row 1: http://dbpedia.org/ontology/Place 89498

Out of curiosity i checked this again with the query below:

select count(distinct ?s) where { ?s a <http://dbpedia.org/ontology/Place> }

tells me it's 754450 2

There's an order of magnitude difference in these 2 counts. Please tell me I'm doing it wrong.

PS: i tried the first query without the group by, order by and limit clause, doesn't make a difference.

kidehen commented 10 years ago

Extending the timeout parameter increases the time allotted to producing the query solution. This feature is critical to enabling the whole world use DBpedia rather than specific queries monopolizing processing time.

See the different results produced when I doubled up the processing time: http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+%3Ftype+count%28distinct+%3Fs%29+as+%3Fc+where+%7B%0D%0A++%3Fs+a+%3Ftype.%0D%0A%7D%0D%0Agroup+by+%3Ftype%0D%0Aorder+by+desc%28%3Fc%29%0D%0Alimit+50&format=text%2Fhtml&timeout=60000&debug=on

Note, there are hard limits also configured on the server that override what may come in from an HTTP client.

indeyets commented 10 years ago

@kidehen timeouts are understandable. But giving wrong result because of timeout is the whole different story.

shouldn't it report failure instead? that's what fuseki does, for example.

both outcomes are not helpful, but fuseki doesn't provide false results

kidehen commented 10 years ago

This isn't a false result.

This is a solution to the query within the constraints of a timeout. The server should indicate via HTTP response metadata the nature of the solution i.e., partial or complete.

This is a feature of Virtuoso.

joernhees commented 10 years ago

Thanks, the time limit explains a bit, but this "feature" is highly confusing if not dangerous because the user (in this case me and i'm not exactly a novice) might not be aware that all the counts might be terribly wrong.

Is there any way to distinguish a "cut-off" result from one which is accurate?

I had assumed that a query which hits a timeout limit would return with an error (something like a 408, even though i'm not sure it's actually the right one) instead of silently returning wrong results.

kidehen commented 10 years ago

On 12/3/13 9:18 AM, Jörn Hees wrote:

Thanks, the time limit explains a bit, but this "feature" is highly confusing if not dangerous because the user (in this case me and i'm not exactly a novice) might not be aware that all the counts might be terribly wrong.

There is a DBpedia fair-use document [1],[2] about this matter. You won't have this issue if you are running your own Virtuoso instance with the DBpedia dataset. Please remember, on the World Wide Web we have to cater for everyone. The Web presents unique challenges to DBMS technology that we address in Virtuoso, specifically.

[1] http://dbpedia.org/OnlineAccess -- search on "Fair Use Policy" [2] http://lists.w3.org/Archives/Public/public-lod/2011Aug/0028.html

Is there any way to distinguish a "cut-off" result from one which is accurate?

This should be part of the response headers. Note: X-SQL-State: S1TAT

I had assumed that a query which hits a timeout limit would return with an error (something like a 408, even though i'm not sure it's actually the right one) instead of silently returning wrong results.

Yes, there has to be HTTP response metadata indicating the state of affairs, to the degree possible. The problem right now is that we don't have any standardization here. 408 doesn't cut it because it implies the request couldn't be completed. In our case, we are completing a task within a set time that has been reached.

The closest analogy here is a quiz contest where you have X seconds to answer a question, this is the very model to which Virtuoso's query engine has been developed.

joernhees commented 10 years ago

@kidehen neither of the links you provide describe/warn of the reported problem: that counts can be wrong if a timeout is hit.

I don't seem to have gotten my point across, let me try again:

I'm not arguing with fair use, timeouts or limits to be able to satisfy more users. I'm a fan!

I'm arguing with the way you're treating a timeout. If a query takes too long there are two ways of dealing with this:

  1. return an error, not a result
  2. return a result with a BIG WARNING
  3. makes a developer, user or scientist (with a quick one-off sparql query in your web interface) look into it again. They will definitely not run in danger of using a wrong results as there is none!
  4. probably leads to the warning being lost somewhere in the process, never be shown to the user and the numbers taken for granted in the end. This is what happened here. Not even your own HTML result page of your SPARQL Web Interface shows the tiniest hint to the user that he should be careful. Even if you're arguing that this is not the prime "end user"... can you name a widely used sparql client / lib which handles this correctly?

I can see your point of view trying to answer the query as good as you can in the given time, but as this report demonstrates it is more dangerous than just returning an error.

joernhees commented 10 years ago

one addenum, sorry: it should be optional to get partial results, not an implicit default that you then have to check for

kidehen commented 10 years ago

On 12/4/13 6:50 AM, Jörn Hees wrote:

one addenum, sorry: it should be optional to get partial results, not an implicit default that you then have to check for

No for the public DBpedia instance. We are deliberately not giving anyone the ability to hog the instance. The instance has to be accessible to the whole world, that's the basic requirement. Again, for those that want to service specific use of DBpedia there are a range of options from making your own instances across:

  1. local setup
  2. cloud setup -- e.g., Amazon AMI.

There are also other instances of DBpedia data across:

  1. http://lod.openlinksw.com/sparql -- LOD Cloud cache (this setup has more computing power behind it)
  2. http://dbpedia-live.openlinksw.com/sparql
  3. http://live.openlinksw.com/sparql.
kidehen commented 10 years ago

On 12/4/13 6:49 AM, Jörn Hees wrote:

@kidehen https://github.com/kidehen neither of the links you provide describe/warn of the reported problem: that counts can be wrong if a timeout is hit.

I don't seem to have gotten my point across, let me try again:

I'm /not/ arguing with fair use, timeouts or limits to be able to satisfy more users. I'm a fan!

I'm arguing with the way you're treating a timeout. If a query takes too long there are two way of dealing with this:

What you are not getting from my comment is the fact that there isn't a notion of "query taking too long" the notion is "what solution can be produced in X amount of seconds, for a given query". There's a world of difference here. The technical challenge is old, SQL DBMS engines never even got to tackling this issue since their usage context (closed world) doesn't expose the problem.

With DBpedia and the Web, everything is unpredicatable. Data is fundamentally time variant.

  1. return an error, not a result
  2. return a result with a BIG WARNING

We return an indicator via HTTP response (which you can test for) re. partial results.

  1. makes a developer, user or scientist (with a quick one-off sparql query in your web interface) look into it again. They will definitely not run in danger of using a wrong results as there is none!

DBpedia isn't a gospel of any kind. That isn't the purpose here. Please think about your request a little. You want a full Transitive Closure intermingled with entailments for all the entity relationship semantics in the data space? I am sure (as you digest that last sentence and its implications) you get the point re., the nature of the pursuit and its fundamental impracticalities, at Web-scale.

  1. probably leads to the warning being lost somewhere in the process, never be shown to the user and the numbers taken for granted in the end. This is what happened here. Not even your own HTML result page of your SPARQL Web Interface shows the tiniest hint /to the user/ that he should be careful. Even if you're arguing that this is not the prime "end user"... can you name a widely used sparql client / lib which handles this correctly?

I can see your point of view trying to answer the query as good as you can in the given time, but as this report demonstrates it is more dangerous than just returning an error.

It is dangerous to attempt the opposite i.e., have no restrictions and let clients deliberately or inadvertently deprive others of use. As I said, there are other options for special use of DBpedia. It isn't right to assume DBpedia is there to produce complete solutions for any kind of query, issued by any kind of client, at any given point in time.

We have made a choice to make DBpedia available to the world, backed up with usage restrictions that defend the goal :-)

joernhees commented 10 years ago

On 4 Dec 2013, at 14:09, Kingsley Idehen notifications@github.com wrote:

It isn't right to assume DBpedia is there to produce complete solutions for any kind of query, issued by any kind of client, at any given point in time.

Are you sure that this is your statement? It's a marketing disaster. And it's not what i'm asking for / reporting here as a problem.

I just wanted the correct counts for types used on the DBpedia endpoint. There is no open world assumption in my query: i'm neither asking the SPARQL endpoint to resolve redirects, nor is this a federated query.

All i'm asking for are its counts at the current point in time. Nothing fancy and i could happily live with an error due to time exceeded.

Partial results are cool when you ask for them (explicitly), i didn't and most people don't.

If not explicitly asked for a partial result, it's more dangerous to report them in a very similar fashion to a complete result than reporting an error.

Ask a couple of developers what they would expect to happen… Rather get an error or a result that looks quite right but isn't?

Cheers, Jörn

kidehen commented 10 years ago

On 12/9/13 12:37 PM, Jörn Hees wrote:

On 4 Dec 2013, at 14:09, Kingsley Idehen notifications@github.com wrote:

It isn't right to assume DBpedia is there to produce complete solutions for any kind of query, issued by any kind of client, at any given point in time.

Are you sure that this is your statement?

My statement is this:

DBpedia is going to produce solutions to SPARQL queries subject to timeout limits and other constraints that have been deliberately configured to ensure global access, in line with its fair use policy. This is how DBpedia's SPARQL endpoint has been configured to run since inception.

It's a marketing disaster.

DBpedia isn't about marketing. I am making a statement about the technical infrastructure behind the DBpedia SPARQL endpoint.

And it's not what i'm asking for / reporting here as a problem.

You are reporting the fact that you are executing a specific query that (in the form you are seeking) exceeds some of the fair use constraints. There are other instances of the DBpedia dataset, associated with different infrastructure, that will give you more computing power per timeout restrictions etc.. Examples include:

[1] http://lod.openlinksw.com/sparql -- all you have to do is simply change the host part of your SPARQL Protocol URL to see what I mean re. this cluster edition of Virtuoso which also has more computing power behind it.

[2] http://dbpedia-live.openlinksw.com -- which doesn't match LOD for capacity but can have less concurrent traffic than the main dbpedia.org SPARQL endpoint.

I just wanted the correct counts for types used on the DBpedia endpoint.

The DBpedia Endpoint is one point of access for the DBpedia dataset. An Endpoint != a Dataset. It is a service that provides access to a dataset. There are other services providing access to the same dataset that are configured for more intensive use of the data. The main endpoint is for the whole world, and that's the focus of its configuration i.e., every one (human or machine) has fair use of the endpoint.

There is no open world assumption in my query: i'm neither asking the SPARQL endpoint to resolve redirects, nor is this a federated query.

That doesn't eradicate entailment, transitive closures, and related matters. Even if you aren't actually de-referencing HTTP URIs, does that apply to all other agents (human or machine) ?

All i'm asking for are its counts at the current point in time.

You are one of many.

Nothing fancy and i could happily live with an error due to time exceeded.

There are other endpoints, change the hostname part of the URL, as I've already told you.

Partial results are cool when you ask for them (explicitly), i didn't and most people don't.

I guess Google give you complete results?

If not explicitly asked for a partial result it's more dangerous to report them in a very similar fashion to a complete result than reporting an error.

For you and your use case. You are but one agent.

Ask a couple of developers what they would expect to happen… Rather get an error or a result that looks quite right but isn't?

It been so since 2007, so I don't understand what you are making a fuss about, especially when the lod.openlinksw.com/sparql instance will more than likely get you a complete answer based on the nature of its configuration.

Cheers, Jörn

— Reply to this email directly or view it on GitHub https://github.com/openlink/virtuoso-opensource/issues/112#issuecomment-30153366.

Regards,

Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen

joernhees commented 10 years ago

On 9 Dec 2013, at 20:18, Kingsley Idehen notifications@github.com wrote:

You are reporting the fact that you are executing a specific query that (in the form you are seeking) exceeds some of the fair use constraints.

wrong, read again

i'm fine with that and always have been

i report that how you treat a timeout is bad, not that there is a timeout

end of my feedback

thanks for your time

j

kidehen commented 10 years ago

On 12/9/13 2:56 PM, Jörn Hees wrote:

On 9 Dec 2013, at 20:18, Kingsley Idehen notifications@github.com wrote:

You are reporting the fact that you are executing a specific query that (in the form you are seeking) exceeds some of the fair use constraints.

wrong, read again

i'm fine with that and always have been

i report that how you treat a timeout is bad, not that there is a timeout

end of my feedback

thanks for your time

j

— Reply to this email directly or view it on GitHub https://github.com/openlink/virtuoso-opensource/issues/112#issuecomment-30167332.

I am not claiming that the timeout treatment is perfect. I've told you repeatedly that we are using a custom HTTP header due to the lack of a standard header for this situation.

This isn't a 408 or 500 condition.

joernhees commented 10 years ago

@iv-an-ru any update on this?

sebastianthelen commented 10 years ago

http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSScalableInference contains a paragraph about partial query answering.

Apparently you get a hint that a query result is incomplete when executing it in isql (haven't tested it though).

jindrichmynarz commented 9 years ago

What is the custom HTTP header that is returned for partial results? Where is it documented?

jindrichmynarz commented 9 years ago

@kidehen: The only headers I see in responses from Virtuoso are Accept-Ranges, Cache-Control, Expires, Server, Connection, Content-Length, Content-Type and Date. I don't see any custom header, which would indicate partial results. This is what I get when running SELECT * WHERE { ?s ?p ?o . }, which is trimmed by ResultSetMaxRows set to 10000 in virtuoso.ini, on the latest develop version of Virtuoso.

kidehen commented 9 years ago

On 12/11/14 3:25 AM, Jindřich Mynarz wrote:

@kidehen : The only headers I see in responses from Virtuoso are Accept-Ranges, Cache-Control, Expires, Server, Connection, Content-Length, Content-Type and Date. I don't see any custom header, which would indicate partial results. This is what I get when running SELECT \* WHERE { ?s ?p ?o . }, which is trimmed by ResultSetMaxRows set to 10000 in virtuoso.ini, on the latest develop version of Virtuoso.

We do provide a number of response headers of which X-SQL-State: S1TAT is our fundamental partial results indicator.

Example:

curl -I 
"http://lod.openlinksw.com/sparql?default-graph-uri=&query=select+distinct+*+where+%7B%5B%5D+a+%3Fo%7D+limit+50&format=text%2Fhtml&CXML_redir_for_subjs=121&CXML_redir_for_hrefs=&timeout=30000&debug=on"

HTTP/1.1 200 OK
Date: Thu, 11 Dec 2014 12:26:30 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 72
Connection: keep-alive
Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu  VDB
Accept-Ranges: bytes
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by 
result timeout.  Activity:     17 rnd    120K seq      0 same seg  
0 same pg      0 same par      0 disk 0 spec disk  856.6KB /     72 mes
X-Exec-Milliseconds: 31315
X-Exec-DB-Activity: 17 rnd    120K seq      0 same seg       0 same 
pg      0 same par      0 disk      0 spec disk  856.6KB /     72 
messages     11 fork

Links:

[1] http://docs.openlinksw.com/virtuoso/anytimequeries.html [2] http://lists.w3.org/Archives/Public/public-lod/2013Jun/0004.html

jindrichmynarz commented 9 years ago

Thanks for the explanation. However, I wasn't able to reproduce it on any other Virtuoso endpoint. For example, using the public DBpedia endpoint to execute SELECT * WHERE { ?s ?p ?o . }:

curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt
wc -l results.csv
# => 10001, i.e. trimmed results
cat headers.txt
# HTTP/1.1 200 OK
# Date: Thu, 11 Dec 2014 14:33:28 GMT
# Content-Type: text/csv; charset=UTF-8
# Content-Length: 1484509
# Connection: keep-alive
# Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu  VDB
# Expires: Thu, 18 Dec 2014 14:33:28 GMT
# Cache-Control: max-age=604800
# Accept-Ranges: bytes
# => i.e. no X-SQL-State header

Is the custom header only sent:

kidehen commented 9 years ago

Because the DBpedia instance has the following in its [SPARQL] INI section:

ResultSetMaxRows  = 10000

Meaning:

The maximum SPARQL solution size for this instance is 10,000 records (for SELECT) [1] and 10,000 entity description triples (for DESCRIBE, which is the most taxing) [2] and 10,000 triples (for CONSTRUCT) [3]. This is limit combined with query timeout is what determines invocation of the "anytime query" feature which, is what leads to partial results, while processing of solution continues within next processing timeout cycle.

Links:

[1] SELECT with LIMIT 1 -- http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+*+where+%7B%3Fs+a+%3Fo%7D+limit+1&format=text%2Fhtml&timeout=30000&debug=on

[2] DESCRIBE with LIMIT 1 -- http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=describe+%3Fs+where+%7B%3Fs+a+%3Fo%7D+limit+1&format=application%2Fx-nice-turtle&timeout=30000&debug=on

[3] LIMIT 1 -- http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=construct+%7B%3Fs+a+%3Fo%7D+where+%7B%3Fs+a+%3Fo%7D+limit+1&format=application%2Fx-nice-turtle&timeout=30000&debug=on

jindrichmynarz commented 9 years ago

Kingsley, it seems you haven't got my question. I'm well aware of the effect of the ResultSetMaxRows configuration. Let me try to clarify. My question was about the missing HTTP header indicating partial results. The problem is:

  1. Do curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt.
  2. Receive partial results (exactly because of the ResultSetMaxRows=10000).
  3. cat headers.txt => No header indicating partial results is there.

So, this indicates that receiving partial results is not sufficient condition for Virtuoso to provide the HTTP header informing that it indeed sent partial results. What are the necessary conditions of a SPARQL request in order for Virtuoso to send a response with the HTTP header indicating partial results?

kidehen commented 9 years ago

Arriving at resultset size, for solution, within timeout. For your example, we already have a solution, and a 10K resultset, within 30000 msec. Thus, not response headers. Put differently, Virtuoso found 10 K triples in less than 30,000 msec.

Action Item: A new custom header is being added for this scenario (it will be live by the time you read this mail), so as to provide additional information to this situation. Basically, X-MaxRows: {ini-hard-limit-value}, which in this case would be 10,000. Note the maximum is 2,000,000 for Virtuoso.

jindrichmynarz commented 9 years ago

OK, I see that if I run curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt, I find X-SPARQL-MaxRows: 10000 in the response headers. This is useful, but it doesn't tell if I have received partial results, because it may be the case that the total number of results is the same as ResultSetMaxRows.

In order to tell if I received partial results I need to execute an additional query, which is my original query wrapped in SELECT (COUNT(*) AS ?count) WHERE { { ... } }:

curl "http://dbpedia.org/sparql?query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%7B+SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D+%7D+%7D&format=text%2Fcsv"

The response for this query tells me there are 943138267 results for my query in total. Given that I know this number I can compare it with the number from the X-SPARQL-MaxRows header and conclude that I have indeed received partial results.

As you can see, executing twice as many queries just to be sure one's not receiving partial results is hardly optimal from the developer's perspective. I think a more developer-friendly solution might be to have a HTTP response header serving as a boolean flag indicating if results are partial or not, irrespective of the cause of incompleteness (e.g., ResultSetMaxRows or timeout query parameter).

joernhees commented 9 years ago

The whole point of this issue was that the current treatment with its 200 status code and additional headers is too implicit for end users, as well as most developers and libraries.

I'm begging you: can we please not serve timeouts / cut off result sets with a 200 http status code? Rather serve them with a 206 status code or some other self invented 555 (server reached some limits, partial result only). Then add the headers on top of that, so one can find out what happened?

kidehen commented 9 years ago

On 12/13/14 9:17 AM, Jindřich Mynarz wrote:

OK, I see that if I run curl "http://dbpedia.org/sparql?query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fcsv" -o results.csv -D headers.txt, I find X-SPARQL-MaxRows: 10000 in the response headers. This is useful, but it doesn't tell if I have received partial results, because it may be the case that the total number of results is the same as ResultSetMaxRows.

It is indicating to you that you have a resultset size of 10000.

SELECT \* FROM {Some-Table} in the "Closed World" SQL RDBMS world. Or SELECT \* WHERE {?s ?p ?o} in the "Open World" RDF RDBMS realm both involve:

  1. query parsing
  2. solution preparation
  3. result set retrieval.

1-2 is the scope of the query timeout, while LIMIT is the resultset size, in regards to Virtuoso.

Basically, LIMIT indicates maximum size of resultset for retrieval. In a SQL RDBMS setup, you scrolls through the resultset using "scrollable cursors" (which had modalities such as; snapshot, static, keyset, dynamic, and mixed [keyset and dynamic]).

In order to tell if I received partial results I need to execute an additional query, which is my original query wrapped in SELECT (COUNT(*) AS ?count) WHERE { { ... } }:

curl"http://dbpedia.org/sparql?query=SELECT+%28COUNT%28*%29+AS+%3Fcount%29+WHERE+%7B+%7B+SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D+%7D+%7D&format=text%2Fcsv"

The response for this query tells me there are 943138267 results for my query in total. Given that I know this number I can compare it with the number from the X-SPARQL-MaxRows header and conclude that I have indeed received partial results.

As you can see, executing twice as many queries just to be sure one's not receiving partial results is hardly optimal from the developer's perspective.

Err... it is, in the context of what you are trying to emulate i.e., a scrollable cursor. Even when doing this on the SQL side of things the DBMS will make one of the following, which have costs:

  1. keyset from all the keys in the tables of a query, in advance, w
  2. keyset created dynamical per cursor scroll
  3. partial keyset that replenished during scrolling .

I think a more developer-friendly solution might be to have a HTTP response header serving as a boolean flag indicating if results are partial or not, irrespective of the cause of incompleteness (e.g., ResultSetMaxRows or timeout query parameter).

This so-called developer cost burden isn't for Virtuoso to bear, it is for the developer, until SPARQL provides some cursor-like mechanism specified. Right now, we can just decide to return false, since an "open world" query doesn't (theoretically) have a known complete solution, let alone solution size.

Performance optimizations in Virtuoso enable you (the developer) to get your count returned quickly. Basically, to each client their heuristic for paging through data.

Kingsley

IvanMikhailov commented 9 years ago

I'm sorry I've faield to push the idea of warnings to the Sparql Protocol spec. "Anytime queries" did not exist at that moment but I was sure that "OK" and "error" is too black-and-white for a real word. Now I don't know any good solution. X-this and X-that in headers are informative but the method has a fatal flaw: one looks to that hidden texts not earlier than a rude error appears.

jindrichmynarz commented 9 years ago

@kidehen: So your recommended solution to determine if a query has partial results is to execute an additional COUNT query?

I don't believe reporting that response to a query has partial results has a significant cost for Virtuoso. This can be added just in case Virtuoso trims results size (e.g., due ResultSetMaxRows or timeout). No additional computation is needed.

BTW, SPARQL is usually (e.g., 1, 2) said to have semantics based on closed-world assumption.

@joernhees and @iv-an-ru: I agree that a non-HTTP 200 code would be nicer, but there having some way how to tell partial results (e.g., custom header) is better than no way.

kidehen commented 9 years ago

On 12/15/14 6:15 AM, Jörn Hees wrote:

The whole point of this issue was that the current treatment with its 200 status code and additional headers is too implicit for end users, as well as most developers and libraries.

I'm begging you: can we please not serve timeouts / cut off result sets with a 200 http status code?

For a query solution that has a fixed resultset size, based on a hard limit, a 200 OK status is accurate. Now what we could consider is some override modality under which an instance owner sets alternative response codes for timeout being exceeded. What we can't do is just move away from 200 OK when:

  1. we have a resource for the "open world" query in question
  2. there are many HTTP clients that treat anything other than 200 OK as a fault.

Rather serve them with a 206 status code or some other self invented 555 (server reached some limits, partial result only).

We can make these configurable by the instance owner, should they not want to work with our defaults.

Then add the headers on top of that, so one can find out what happened?

Conditionally (by way of instance config), as indicated above.

kidehen commented 9 years ago

On 12/15/14 8:33 AM, Jindřich Mynarz wrote:

@kidehen https://github.com/kidehen: So your recommended solution to determine if a query has partial results is to execute an additional COUNT query?

I am saying to you that the issue of cursors is common, not new. It is solved by a spec having a notion of cursors or a developer implementing that client-side. What you don't do, as the server provider is implement that in a way that simply introduces performance overhead that isn't understood by clients.

I don't believe reporting that response to a query has partial results has a significant cost for Virtuoso.

You want us to tell you that the hard LIMIT is X out of a total of Y. And I am saying you can figure that out, as you alluded to on the client side, in your code. That isn't a cost for the server to bear for you specific usecase scenario.

If you were running your own Virtuoso instance, you can opt to not have a hard limit in the INI, or set it to the max of 2 million.

This can be added just in case Virtuoso trims results size (e.g., due ResultSetMaxRows or timeout). No additional computation is needed.

"Anytime Query" is about query solution preparation and resultset retrieval within configurable time limits, and max resultset sizes.

BTW, SPARQL is usually (e.g., 1 http://web.ing.puc.cl/%7Emarenas/publications/pods11b.pdf, 2 http://ceur-ws.org/Vol-1272/paper_50.pdf) said to have semantics based on closed-world assumption.

Even if it did, you are seeking something that isn't offered by other DBMS engines i.e., an ability to provide a complete response to: `SELECT

@joernhees https://github.com/joernhees and @iv-an-ru https://github.com/iv-an-ru: I agree that a non-HTTP 200 code would be nicer, but there having /some/ way how to tell partial results (e.g., custom header) is better than no way.

See my response to the HTTP 200 matter.

jindrichmynarz commented 9 years ago

I get that if SPARQL had cursors this would be solved differently.

What you don't do, as the server provider is implement that in a way that simply introduces performance overhead that isn't understood by clients.

Sorry, I have trouble parsing this sentence.

You want us to tell you that the hard LIMIT is X out of a total of Y.

No. This is already provided by the X-SPARQL-MaxRows header that you have recently introduced into Virtuoso. What I would like to know instead is if this hard limit was applied. If X-SPARQL-MaxRows = 10000 and I receive 10000 results, there's no way of telling if it's complete or partial result set without executing additional COUNT query. Is there another way I'm missing?

joernhees commented 9 years ago

I sense some anger in this discussion again. I think this is coming from different points of view rather than anyone attacking Virtuoso. You guys are doing an awesome job. So awesome that we developers come to you as the defacto lead in public SPARQL endpoints to give feedback and ask for things which would make our lives easier / reduce misunderstandings in development.

I guess all our feedback in this issue boils down to that we as developers want to be able to handle partial results better when communicating with virtuoso endpoints.

I think there are several dimensions to this, which are entangled in our discussion:

Thinking about all three i was reminded of a tiny but powerful rule from the python zen: "explicit is better than implicit".

What could this mean for the dimensions above (just as thoughts):

Chicken & egg problem:

You can read the first paragraph of this post another way: Because you're the defacto lead for public SPARQL endpoints, your defaults are pretty close to becoming the standard. If your default treatment of partial results is not informative for the closed world case, then it can never be for federated queries.

kidehen commented 9 years ago

On 12/15/14 9:42 AM, Jindřich Mynarz wrote:

I get that if SPARQL had cursors this would be solved differently.

What you don't do, as the server provider is implement
that in a way that simply introduces performance overhead that isn't
understood by clients.

Sorry, I have trouble parsing this sentence.

You want us to tell you that the hard LIMIT is X out of a total of Y.

I am telling you that we have the following distinct items:

  1. query solution
  2. query resultset retrieval.

They are not the same thing.

We set a limit from which you fetch the solution in batches. This is why we have the following INI excerpt, re. DBpedia:

[SPARQL]

ResultSetMaxRows           = 10000 ; Resultset size (the maximum amount 
of items allowed per resultset retrieval, associated with a query 
solution. Use this to  page through the solution)
MaxQueryCostEstimationTime = 120        ; in seconds
MaxQueryExecutionTime      = 30 ; in seconds ; time it takes to retrieve 
10,000 items (in relation to SELECT, CONSTRUCT, or DESCRIBE queries)
DefaultQuery               = select distinct \* where {?s ?p ?o} limit 50 
; default query presented by endpoint page

No. This is already provided by the X-SPARQL-MaxRows header that you have recently introduced into Virtuoso. What I would like to know instead is if this hard limit was applied. If X-SPARQL-MaxRows = 10000 and I receive 10000 results, there's no way of telling if it's complete or partial result set without executing additional COUNT query. Is there another way I'm missing?

You have to perform the additional count query because these heuristics are yours, not for the DBMS. Basically, Virtuoso will not do that for you as it is an expensive operation that totally skews what it is doing. Can you point me to a DBMS that does that, available on the Web anywhere? Do you for one second thing even Google results page contain N number of matches for a total solution size?

In SQL, Scrollable Cursors are a feature. Net effect, they are distinct from basic operations i.e., you don't conflate:

select * from table -- without cursors and the same query with cursors. APIs like ODBC enable you to fetch data with or without scrollable cursors.

Recap:

Partial condition arises when Virtuoso can't produce a complete solution within the timeouts outlined in the [SPARQL] INI section (stanza) outlined above, for a resultset of 10,000.

Query Solution Size != Query Results Retrieval Max Items Size, at least not in the case of Virtuoso.

Do you have an example of a DBMS product that offers what you are seeking? Maybe we can make more progress based on such an example.

jindrichmynarz commented 9 years ago

I am telling you that we have the following distinct items:

  1. query solution
  2. query resultset retrieval.

They are not the same thing.

I don't think I ever confused these two.

Basically, Virtuoso will not do that for you as it is an expensive operation that totally skews what it is doing.

I don't think we understand each other. Let me try to clarify. When Virtuoso trims the results set size to ResultSetMaxRows, it can as well add an additional header indicating the results set is trimmed. No additional computation is needed. You can hook this into the existing logic, which decides whether to trim results sets or not.

kidehen commented 9 years ago

On 12/15/14 11:18 AM, Jörn Hees wrote:

I sense some anger in this discussion again. I think this is coming from different points of view rather than anyone attacking Virtuoso. You guys are doing an awesome job. So awesome that we developers come to you as the defacto lead in public SPARQL endpoints to give feedback and ask for things which would make our lives easier / reduce misunderstandings in development.

I guess all our feedback in this issue boils down to that we as developers want to be able to handle partial results better when communicating with virtuoso endpoints.

I think there are several dimensions to this, which are entangled in our discussion:

  • scope (open / closed world?)
  • halting problem: can the server tell that a query was hitting its configured boundaries (i'm avoiding the word |LIMIT| so it's not confused with the SPARQL clause)? (timeout / max result size / ...?) Can it tell if any of the other endpoints it asks hit them?
  • form of presentation (visibility to the developer?)

Thinking about all three i was reminded of a tiny but powerful rule from the python zen: "explicit is better than implicit" https://www.python.org/dev/peps/pep-0020/.

Yes, explicit is better than implicit for sure. But we also have to understand the boundaries.

In the SQL realm, you would do one of the following:

  1. Use scrollable cursors -- the apis differe per SQL rdbms
  2. Use a generic API like ODBC or JDBC -- scrollable cursors implementations vary per driver (re., types supported and actual performance)
  3. Make your own cursoring -- this is how it was done pre ODBC and JDBC.

What could this mean for the dimensions above (just as thoughts):

  • scope: isn't it OK to treat a query as closed world assumption unless it is a federated query / asks for sponging? (So closed world until explicitly stated otherwise?)

Yes, but even if its "closed world" you have the issue of data volume and access frequency to deal with, at Web scale.

  • halting problem:
    • if the scope was closed world, wouldn't the server know if it hits some configured boundary and could just tell us?

Yes, which is what it is doing. I tells you when it wasn't able to complete results retrieval based on the combination of the following factors:

  1. query cost estimation
  2. query solution production.

Thus, given:

[SPARQL]
ResultSetMaxRows           = 10000
MaxQueryCostEstimationTime = 120        ; in seconds
MaxQueryExecutionTime      = 30 ; in seconds

It will indicate a partial resultset return via HTTP if it couldn't prepare a resultset of 10,000 items within 30,000 msecs. What is isn't doing it first making a count of the solution (or solutionset for possible additional clarity) and then concluding that because its retrieval threshold per resultset item is 10,000 that is a partial solution when it isn't.

*

  • (I'm not sure how complex this would be internally, but i guess whatever detects if a boundary is exceeded and stops execution could probably also add that information to the result.)

It is doing that.

*

  • Obviously an open world assumption is a different story, but shouldn't the server still be able to inform us when it hits its own boundaries / is waiting for a third party for too long / a third party maybe exceeded its boundaries? (chicken & egg problem)

Re. SPARQL-FED we should have the same thing re. timeouts which can affect all sorts of things e.g., unions of SERVICE based query patterns in a query. Ditto unions of SQL queries of SQL Tables attached to Virtuoso.

  • form of presentation:
    • If the client side query explicitly states the limit which is exceeded i guess a 200 status code with a partial result is ok.
    • If the server runs into limits the client query didn't explicitly state (e.g., some defaults, fairness of use, etc. limits) then the result should rather not be a 200 as it doesn't force developers to deal with them correctly.

Which is why we can improve things here by making 20X configurable by the instance owner. I say that because there are HTTP clients that could fault on 20X because they are coded for 200 OK only.

*

  • Still the partial content could be delivered, either as 206 or in the 5xx range...

Yes, if you configure your instance that way, when we add this feature to the [SPARQL] INI section.

*

  • In both cases (as @jindrichmynarz https://github.com/jindrichmynarz seems to suggests): Headers which explain and potentially reduce follow up queries "just to find out if a result was partial" / which boundaries were hit would be great.

Yes, but he isn't distinguishing the solution size from the resultset retrieval size, as implemented in Virtuoso. He would like delta existence and size to be determined by Virtuoso and then used as the basis for the notion of a "partial result" re., this "anytime query" feature.

Chicken & egg problem:

You can read the first paragraph of this post another way: Because you're the defacto lead for public SPARQL endpoints, your defaults are pretty close to becoming the standard. If your default treatment of partial results is not informative for the closed world case, then it can never be for federated queries.

Our short-term option is for these 20X responses to be configurable. In addition, we need folks to accept the fact that Virtuoso distinguishes:

  1. query solution
  2. query solution result set retrieval size -- i.e., you can retrieve all the items associated with a solution in batches (each batch has a max results retrieval size), not one go.

Another possibility, when we have the time, is publish a guide for emulating scrollable cursors via SPARQL i.e., provide the SPARQL client heuristic for dealing with massive data, using SPARQL, at Web Scale.

kidehen commented 9 years ago

On 12/15/14 12:02 PM, Jindřich Mynarz wrote:

I am telling you that we have the following distinct items:

 1. query solution
 2. query resultset retrieval.

They are not the same thing.

I don't think I ever confused these two.

Basically, Virtuoso will not do that for you as it is an expensive
operation that totally skews what it is doing.

I don't think we understand each other. Let me try to clarify. When Virtuoso trims the results set size to |ResultSetMaxRows|, it can as well add an additional header indicating the results set is trimmed.

It doesn't TRIM the result set. It stops fetching data from addresses (internal to the engine) associated with the solution. The fact that you use the term TRIM indeed reveals the confusion. You trim from a physical whole. That isn't what's happening here.

No additional computation is needed. You can hook this into the existing logic, which decides whether to trim results sets or not.

I politely disagree with your assumptions.

I don't know if you have an experience with scrollable cursors in the realm of SQL. If not, it would help with this conversation. I know what you want, but you don't seem to be accepting the paradoxical nature of what you seek, from a DBMS perspective.

There is a reason why there are no live ad-hoc SQL RDBMS engines on the Web (bar ours [1]), for any client to query.

[1] http://demo.openlinksw.com/XMLAexplorer/XMLAexplorer.html -- example of an ad-hoc query service for SPARQL and SQL that's live on the Web.

Kingsley

jindrichmynarz commented 9 years ago

OK, I see I may have used confusing terms (e.g., "trimming"). @kidehen, thank you for pointing that out.

I never meant to imply that Virtuoso first counts the size of a query result set and then trims its size to the ResultSetMaxRows. What I asked about is that when Virtuoso reaches ResultSetMaxRows or timeout and it stops the query execution, it can make this explicit by e.g., adding an HTTP header indicating partial results set. This is what I meant when I said that no additional computation is needed.

kidehen commented 9 years ago

On 12/15/14 12:36 PM, Jindřich Mynarz wrote:

OK, I see I may have used confusing terms (e.g., "trimming"). @kidehen https://github.com/kidehen, thank you for pointing that out.

I never meant to imply that Virtuoso first counts the size of a query result set and then trims its size to the |ResultSetMaxRows|. What I asked about is that when Virtuoso reaches |ResultSetMaxRows| or |timeout| and it stops the query execution, it can make this explicit by e.g., adding an HTTP header indicating partial results set. This is what I meant when I said that no additional computation is needed.

Virtuoso doesn't stop query execution. The preparation of a solution takes seconds. Its the retrieval of the items associated with the solution that pose challenges re., transportation from DBMS to client.

We have to move the items associated with the solution from virtuoso's internal space to that of a virtuoso client. Our timeout condition arises when we haven't prepared the solution items for transportation (so to speak) via a conveyor that holds <= ResultSetMaxRows capacity.

In ODBC/JDBC (where these matters are handled with better clarity), you have query resultset fetching distinct from query solution preparation. Then, a client fetches resultset items from DB server until there's nothing left. If using cursors you build keysets (of different kinds: all keys in tables in query prepared prior to fetching, keys prepared dynamically prior to each fetch, partial keyset size that's only generated when exceeded during each fetch ).

The bottom-line issue here is that we are paging (cursoring) through the items that constitute a query solution. This matter isn't a trivial as it might appear, at first blush. Ultimately, we can make time provide an example that outlines a heuristic that can be used by clients trying to work with this level of granularity.

The SPARQL Query Protocol, which is for all intents and purposes is the ODBC/JDBC equivalent for SPARQL Queries, is what's lacking here.

We are going to need Link: headers on both the client and the server to make this really work right, in a generic way, at Web scale. A client has to indicate to the server it wants to work with a cursor, and the type of cursor should be negotiated b/w client and server, once negotiated, they keyset mechanism and size will be known, and retrieval of result can be much smarter.

If we are going to do scrollable cursors, it should be done right, even it this is via HTTP headers without enhancing the SPARQL Protocol directly. How about that?

joernhees commented 9 years ago

@kidehen i think all that @jindrichmynarz suggests is that virtuoso could add a header if (speaking in your terminology): no timeout condition arises but virtuoso has prepared the solution items for transportation (so to speak) via a conveyor that holds > ResultSetMaxRows capacity.

In that case the server knows without additional work: the client won't get all that's on the conveyor belt, so a somehow partial/truncated/limited/whateverword result.

The thing is that if i write a SPARQL Query with LIMIT 100 and 100 results are returned i know i probably should try to continue... (but i'm not sure)... With that header i could be sure that i need to if it's present... bad thing is that i can't be sure i don't need to if it's not present.

But the header would be even more meaningful in other cases: what if i don't specify a limit in my query? I don't see the ResultSetMaxRows set in the virtuoso.ini as a client (or do i?).

With that "ResultSetLimitHit" header i could know that there is maybe more.

Why maybe? (please correct if this is wrong): I think you pointed this out before: the conveyer-belt could be empty by coincidence when the next chunk isn't prepared yet, but the result set size limited by explicit LIMIT clause or by ResultSetMaxRows is reached.

If that's how it works one could even think of two headers: "ResultSetLimitHit" / "ResultSetLimitExceeded" or with one header and two values: "ResultSetLimit: Hit" "ResultSetLimit: Exceeded".

kidehen commented 9 years ago

We can do two things here:

  1. add more headers
  2. add service parameters that indicate to Virtuoso the need to perform a count as part of the workload -- using this additional parameter prevents a costly heuristic from skewing the query solution and retrieval times.

Re. #1, this is closer to your message above.

I'll pick these items up with my development team.

rnavarropiris commented 7 years ago

@kidehen: I recently stumbled upon this issue when sending a query over the JDBC interface. However, according to the Virtuoso documentation this should only apply to the the SPARQL web service:

[SPARQL]
The SPARQL section sets parameters and limits for SPARQL query protocol web service service.
This section should stay commented out as long as SPARQL is not in use.
Section RDF Data Access and Data Management contains detailed description of this functionality.

Is this the intended behaviour? Is there a way of bypassing this limit (e.g. ResultSetMaxRows=0 as in 'no limit')?

kanihal commented 6 years ago

any update on this? has the new header indicator for partial results been implemented?

VladimirAlexiev commented 5 years ago

For what it's worth: today, the original query returns just 3 results

select ?type count(distinct ?s) as ?c where {
  ?s a ?type.
}
group by ?type
order by desc(?c)

If you remove the distinct which you don't need:

select ?type (count(*) as ?c) where {
  ?s a ?type.
}
group by ?type
order by desc(?c)

you get a lot more results, and the count for dbo:Place is 881597.

If you ask only for dbo:Place, you get the same count:

select (count(*) as ?c) where {
  ?s a dbo:Place
}

Note: you get a different count for schema:Place but that's a matter of dbpedia mapping quality and dbpedia ontology choices, not of the SPARQL server.

kidehen commented 2 years ago

I'm begging you: can we please not serve timeouts / cut off result sets with a 200 http status code? Rather serve them with a 206 status code or some other self invented 555 (server reached some limits, partial result only). Then add the headers on top of that, so one can find out what happened?

We haven't gone for 206 because too many HTTP apps code for 200 OK; i.e., anything other than that is considered a failure.

If you want to ensure there are no partial solutions, subject to the hard timeout configured for the target instance, simply set your query timeout to 0 or any value less than 1000 msec.

For instances like DBpedia (available to a cocktail of user agents and query profiles), the "Fair Use" policy will kick as an upper boundary in line with its global timeout setting (enforced server side).

kidehen commented 2 years ago

We could consider letting an instance owner configure their preferred response codes in relation to modalities associated with hard-timeouts (as demonstrated by DBpedia's 'Fair Use Policy') and "Anytime Query", etc.

jmkeil commented 2 years ago

Adding my key points from our discussion on Twitter, I think it would be the best to respond the results with HTTP status 200 in case of complete results and with some fixed, custom 3xx status in case of partial results due to time (Anytime Query) or size (MaxRows) constraints. This way, adapted SPARQL clients will be able to exploit the partial result, but other clients will not treat the partial result as complete result and all clients get what they expect. Making the status code for partial results configurable would prevent the implementation of generic, adapted clients making use of the partial result features.

Setting timeout=0 in the request seems not to cause a failure (status >= 400 response) in case of partial results (see the little experiment on twitter). Maybe this is a bug, you might want to address in another issue? But even it would work, it would not help unadapted clients, because they would not make use of the parameter.

Leaving out the timeout parameter seems to completely disable timeouts, including upper limits set in the server configuration (see the little experiment on twitter). This also bug looks like a bug, you might want to handle in yet another issue?

Further, maybe consider to propose an optional extension of the SPARQL protocol for partial results. There is already a proposal for an optional extension of the SPARQL protocol. Additional HTTP header fields could provide further help to

kidehen commented 2 years ago

Do these requests apply to DBpedia or Virtuoso?

If they apply to Virtuoso, note that you can actually alter the SPARQL endpoints behavior to suit whatever you desire, by writing a custom stored procedure that alters HTTP response codes, etc.

If they apply to DBpedia, then we are back to a deliberate "Fair Use" behavior and the following alternatives:

jmkeil commented 2 years ago

Do these requests apply to DBpedia or Virtuoso?

It applies to Virtuoso. DBpedia was just an example. Same issues occur on https://demo.openlinksw.com/. For example, the execution of SELECT ?s ?p ?o WHERE {?s ?p ?o} (with timeout=0 and without timeout parameter) returns partial result with status 200 and header field X-SPARQL-MaxRows: 10000.

If they apply to Virtuoso, note that you can actually alter the SPARQL endpoints behavior to suit whatever you desire, by writing a custom stored procedure that alters HTTP response codes, etc.

It does not matter here, if an instance administrator can do that. Clients can not do that, even if they would be aware of the issue. Virtuoso should by default conform to the HTTP and SPARQL protocol: If it does not exactly return what has been requested, it must not (as in RFC 2119) response with status 200.

imitko commented 2 years ago

@jmkeil

This behaviour is related to the config parameter ResultSetMaxRows, and since 206 (partial content) can be only if range requested the 200 response is returned with X-SPARQL-MaxRows.

HTH

jmkeil commented 2 years ago

As I wrote on May 6 (above)

I think it would be the best to respond the results with HTTP status 200 in case of complete results and with some fixed, custom 3xx status in case of partial results due to time (Anytime Query) or size (MaxRows) constraints. This way, adapted SPARQL clients will be able to exploit the partial result, but other clients will not treat the partial result as complete result and all clients get what they expect. Making the status code for partial results configurable would prevent the implementation of generic, adapted clients making use of the partial result features.

kidehen commented 2 years ago

Here's an idea regarding this matter:

When “Anytime Query” is deemed enabled, the following occurs:

  1. HTTP Requests (comprising Range Header) sent to server
  2. HTTP 206 Response returned from server

Thoughts and comments welcome.