sparql: counts don't seem to be reliable

joernhees commented 10 years ago

I'm trying to get a top type count for DBpedia (Virtuoso version 07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):

select ?type count(distinct ?s) as ?c where {
  ?s a ?type.
}
group by ?type
order by desc(?c)
limit 50

returns (apart from other rows) this row 1: http://dbpedia.org/ontology/Place 89498

Out of curiosity i checked this again with the query below:

select count(distinct ?s) where { ?s a <http://dbpedia.org/ontology/Place> }

tells me it's 754450 2

There's an order of magnitude difference in these 2 counts. Please tell me I'm doing it wrong.

PS: i tried the first query without the group by, order by and limit clause, doesn't make a difference.

jmkeil commented 2 years ago

Here's an idea regarding this matter:

When “Anytime Query” is deemed enabled, the following occurs:

HTTP Requests (comprising Range Header) sent to server

HTTP 206 Response returned from server

Thoughts and comments welcome.

This will not work out for the following reasons:

Clients will not send requests with range header, as they have no reason to do so.
There is no way to convert a range of rows/bindings into a range of bytes without knowing the content beforehand.

If clients only want specific rows, they would use the SPARQL keywords LIMIT and OFFSET. Range request header and 206 Partial Content are meant for the partitioning of responses on byte level. An example use case is the resumption of large downloads. Partitioning on byte level in SPARQL context would result in sending incomplete rows and invalid responses.

Once again:

I think it would be the best to respond the results with HTTP status 200 in case of complete results and with some fixed, custom 3xx status in case of partial results due to time (Anytime Query) or size (MaxRows) constraints. This way, adapted SPARQL clients will be able to exploit the partial result, but other clients will not treat the partial result as complete result and all clients get what they expect. Making the status code for partial results configurable would prevent the implementation of generic, adapted clients making use of the partial result features.

Thoughts and comments welcome. I didn't saw any argument against this approach yet.

kidehen commented 2 years ago

There's a new note about changes we've made to the "Anytime Query" feature of Virtuoso.

Goal: To reduce confusion by offering more configuration options and use of HTTP response code.

Read: https://community.openlinksw.com/t/technology-update-virtuoso-anytime-query-functionality-for-query-scalability/3388

jmkeil commented 1 year ago

That is little progress, but it does not really solve the problems:

[ ] behavior is not standard compliance with default configuration
- This use of status 206 is not compliant to the standards:
  - the quoted phrase from RFC 9110 says that "a server might want to send [in response to a range request] only a subset of the data requested for reasons of its own"
  - partitioning implied by 206 is on byte level, not on row level
- please consider the use of some custom 3XX code or 413 (Content Too Large)
[ ] clients should not consume partial results due to anytime timeout without special handling
- SPARQL clients still consume partial results due to anytime timeout as complete results. For example, the following SERVICE request will not fail (checked with local Apache Fuseki 4.6.1 and demo.openlinksw.com/sparql (Virtuoso 08.03.3326)):
```
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?test1
WHERE {
SERVICE <https://dbpedia.org/sparql> {
  SELECT (SUM(xsd:integer(CONCAT(?v1,?v2,?v3,?v4,?v5,?v6,?v7,?v8,?v9))) AS ?test1)
  WHERE {
    VALUES ?v1 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v2 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v3 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v4 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v5 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v6 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v7 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v8 {0 1 2 3 4 5 6 7 8 9}
    VALUES ?v9 {0 1 2 3 4 5 6 7 8 9}
  }
}
}
```
  - configuring a more suited status code for anytime timeout will probably solve this, but this should be the case by default
- check 1 (Long running query with timeout=0 (as by the UI)) on demo.openlinksw.com/sparql (Virtuoso 08.03.3326))%20AS%20?test1)%20WHERE%20%7B%20VALUES%20?v1%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v2%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v3%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v4%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v5%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v6%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v7%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v8%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v9%20%7B0%201%202%203%204%205%206%207%208%209%7D%20%7D&timeout=0) or https://dbpedia.org/sparql (Virtuoso 08.03.3326))%20AS%20?test1)%20WHERE%20%7B%20VALUES%20?v1%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v2%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v3%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v4%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v5%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v6%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v7%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v8%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v9%20%7B0%201%202%203%204%205%206%207%208%209%7D%20%7D&timeout=0) with Accept: application/sparql-results+json
  - status 400 and status 500
- check 2 (Long running query without timeout parameter (as by arbitrary clients)) on demo.openlinksw.com/sparql (Virtuoso 08.03.3326))%20AS%20?test2)%20WHERE%20%7B%20VALUES%20?v1%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v2%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v3%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v4%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v5%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v6%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v7%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v8%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v9%20%7B0%201%202%203%204%205%206%207%208%209%7D%20%7D) or https://dbpedia.org/sparql (Virtuoso 08.03.3326))%20AS%20?test2)%20WHERE%20%7B%20VALUES%20?v1%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v2%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v3%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v4%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v5%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v6%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v7%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v8%20%7B0%201%202%203%204%205%206%207%208%209%7D%20VALUES%20?v9%20%7B0%201%202%203%204%205%206%207%208%209%7D%20%7D) with Accept: application/sparql-results+json
  - status 400 and status 206
[ ] clients should not consume partial results due to max rows without special handling
- behavior is not changed: partial results due to max rows are still returned with status 200
- check 3 (Max rows exceeding query with timeout=0 (as by the UI, I don't expect effect of timeout here)) on demo.openlinksw.com/sparql (Virtuoso 08.03.3326) or https://dbpedia.org/sparql (Virtuoso 08.03.3326) with Accept: application/sparql-results+json
  - still partial results with status 200
- check 4 (Max rows exceeding query without timeout parameter (as by arbitrary clients, I don't expect effect of timeout here)) on demo.openlinksw.com/sparql (Virtuoso 08.03.3326) or https://dbpedia.org/sparql (Virtuoso 08.03.3326) with Accept: application/sparql-results+json
  - still partial results with status 200

TallTed commented 1 year ago

Thank you for the detailed update, @jmkeil. I'll leave a detailed response to others.

Please do be aware that when providing Virtuoso version info, it's important to include the git_head value which pins down the exact code from which a running binary was built. Without this value, the binary may have been built from any codepoint over weeks or even months. You can get the git_head with the SPARQL query on this page or from the footer of recent versions of /sparql, /fct, or various other hosted applications, as can today be seen on demo.openlinksw.com/sparql (9566ba38b4) and dbpedia.org/sparql (5ca4dd4f09).

openlink / virtuoso-opensource

sparql: counts don't seem to be reliable #112