openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
863 stars 210 forks source link

sparql: counts don't seem to be reliable #112

Open joernhees opened 10 years ago

joernhees commented 10 years ago

I'm trying to get a top type count for DBpedia (Virtuoso version 07.00.3207 on Linux (x86_64-redhat-linux-gnu), Single Server Edition):

select ?type count(distinct ?s) as ?c where {
  ?s a ?type.
}
group by ?type
order by desc(?c)
limit 50

returns (apart from other rows) this row 1: http://dbpedia.org/ontology/Place 89498

Out of curiosity i checked this again with the query below:

select count(distinct ?s) where { ?s a <http://dbpedia.org/ontology/Place> }

tells me it's 754450 2

There's an order of magnitude difference in these 2 counts. Please tell me I'm doing it wrong.

PS: i tried the first query without the group by, order by and limit clause, doesn't make a difference.

jmkeil commented 2 years ago

Here's an idea regarding this matter:

When “Anytime Query” is deemed enabled, the following occurs:

  1. HTTP Requests (comprising Range Header) sent to server
  2. HTTP 206 Response returned from server

Thoughts and comments welcome.

This will not work out for the following reasons:

  1. Clients will not send requests with range header, as they have no reason to do so.
  2. There is no way to convert a range of rows/bindings into a range of bytes without knowing the content beforehand.

If clients only want specific rows, they would use the SPARQL keywords LIMIT and OFFSET. Range request header and 206 Partial Content are meant for the partitioning of responses on byte level. An example use case is the resumption of large downloads. Partitioning on byte level in SPARQL context would result in sending incomplete rows and invalid responses.

Once again:

I think it would be the best to respond the results with HTTP status 200 in case of complete results and with some fixed, custom 3xx status in case of partial results due to time (Anytime Query) or size (MaxRows) constraints. This way, adapted SPARQL clients will be able to exploit the partial result, but other clients will not treat the partial result as complete result and all clients get what they expect. Making the status code for partial results configurable would prevent the implementation of generic, adapted clients making use of the partial result features.

Thoughts and comments welcome. I didn't saw any argument against this approach yet.

kidehen commented 2 years ago

There's a new note about changes we've made to the "Anytime Query" feature of Virtuoso.

Goal: To reduce confusion by offering more configuration options and use of HTTP response code.

Read: https://community.openlinksw.com/t/technology-update-virtuoso-anytime-query-functionality-for-query-scalability/3388

jmkeil commented 1 year ago

That is little progress, but it does not really solve the problems:

TallTed commented 1 year ago

Thank you for the detailed update, @jmkeil. I'll leave a detailed response to others.

Please do be aware that when providing Virtuoso version info, it's important to include the git_head value which pins down the exact code from which a running binary was built. Without this value, the binary may have been built from any codepoint over weeks or even months. You can get the git_head with the SPARQL query on this page or from the footer of recent versions of /sparql, /fct, or various other hosted applications, as can today be seen on demo.openlinksw.com/sparql (9566ba38b4) and dbpedia.org/sparql (5ca4dd4f09).