openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
858 stars 210 forks source link

Inconsistent anytime response headers behaviour on SPARQL endpoint #1034

Open serasset opened 2 years ago

serasset commented 2 years ago

We are trying to query a dataset while taking care of anytime feature (we indeed want to get partial results rather than errors in case we hit some limits).

Queries are executed on the same virtuoso instance at http://kaiko.getalp.org/sparql

Q1:

SELECT ?lemma, COUNT(?written_trans) AS ?nb_translations
WHERE {
    ?lexentry ontolex:canonicalForm [
        ontolex:writtenRep ?lemma 
    ] .
    FILTER (lang(?lemma) = 'en')
    ?trans dbnary:isTranslationOf ?lexentry ;
        dbnary:writtenForm ?written_trans .
} LIMIT 1000

Q2:

SELECT distinct ?A ?p ?B 
WHERE {
    ?a a/rdfs:subClassOf* ?A. 
    ?a ?p ?b. 
    OPTIONAL { ?b rdfs:subClassOf ?B }}

Without timeout value specified

curl -I 0.00s user 0.01s system 0% cpu 22.525 total

- Q2 times out after 60 seconds

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=select+distinct+%3FA+%3Fp+%3FB+where+%7B%3Fa+a%2Frdfs%3AsubClassOf*+%3FA.+%3Fa+%3Fp+%3Fb.+OPTIONAL+%7B+%3Fb+rdfs%3AsubClassOf+%3FB+%7D%7D&should-sponge=&format=text%2Fhtml&timeout=&debug=on&run=+Run+Query+' HTTP/1.1 500 SPARQL Request Failed Date: Wed, 27 Apr 2022 11:44:09 GMT Server: Virtuoso/07.20.3230 (Linux) x86_64-pc-linux-gnu Accept-Ranges: bytes Content-Type: text/plain; charset=UTF-8 Content-Length: 168 Connection: close

curl -I 0.00s user 0.01s system 0% cpu 1:01.65 total


So far so good.

**With timeout value specified**

- Q1 indeeds hits the timeout of 1 second and give back the expected `X-SQL-State: S1TAT` with additional headers.

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=SELECT+%3Flemma%2C+COUNT%28%3Fwritten_trans%29+AS+%3Fnb_translations%0D%0AWHERE+%7B%0D%0A++++%3Flexentry+ontolex%3AcanonicalForm+%5B%0D%0A++++++++ontolex%3AwrittenRep+%3Flemma+%0D%0A++++%5D+.%0D%0A++++FILTER+%28lang%28%3Flemma%29+%3D+%27en%27%29%0D%0A++++%3Ftrans+dbnary%3AisTranslationOf+%3Flexentry+%3B%0D%0A++++++++dbnary%3AwrittenForm+%3Fwritten_trans+.%0D%0A%7D+LIMIT+1000%0D%0A&should-sponge=&format=text%2Fhtml&timeout=1001&debug=on&run=+Run+Query+' HTTP/1.1 200 OK Date: Wed, 27 Apr 2022 10:23:53 GMT Server: Virtuoso/07.20.3230 (Linux) x86_64-pc-linux-gnu Accept-Ranges: bytes X-SQL-State: S1TAT X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 4.649M rnd 2.042M seq 2.227M same seg 1.086M same pg 119.5K same par 0 disk 0 spec disk 0B / 0 X-Exec-Milliseconds: 3142 X-Exec-DB-Activity: 4.649M rnd 2.042M seq 2.227M same seg 1.086M same pg 119.5K same par 0 disk 0 spec disk 0B / 0 messages 0 fork Content-Type: text/html; charset=UTF-8 Content-Length: 81684

- However, Q2 also hits the timeout and tries to return partial data (as expected by anytime query behaviour), **but it does not give any response header stating so.**

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=select+distinct+%3FA+%3Fp+%3FB+where+%7B%3Fa+a%2Frdfs%3AsubClassOf*+%3FA.+%3Fa+%3Fp+%3Fb.+OPTIONAL+%7B+%3Fb+rdfs%3AsubClassOf+%3FB+%7D%7D&should-sponge=&format=text%2Fhtml&timeout=1001&debug=on&run=+Run+Query+' HTTP/1.1 200 OK Date: Wed, 27 Apr 2022 10:37:19 GMT Server: Virtuoso/07.20.3230 (Linux) x86_64-pc-linux-gnu Accept-Ranges: bytes Content-Type: text/html; charset=UTF-8 Content-Length: 192742

curl -I 0.00s user 0.01s system 0% cpu 3.217 total



**Expected behaviour**: both query hit timeout in anytime query settings and should both return `X-SQL-State: S1TAT` with additional headers.
TallTed commented 2 years ago

First thing, I see that your endpoint is running version 07.20.3230 (a/k/a 7.2.6.1), built Sep 27, 2020 from git_head 840b468fc, committed on Sep 1, 2020, probably from the stable/7 branch.

I would suggest you test against a current build from the develop/7 branch (which we will be shipping as/to stable/7 in the next few weeks). If we can quickly confirm this behavior in the latest develop/7 branch, we may be able to get a patch in before that stable/7 push.

serasset commented 2 years ago

I recompiled virtuoso on my debian machine with :

./configure --with-layout=debian --with-readline --enable-dbpedia-vad --enable-fct-vad --enable-rdfmappers-vad --with-port=2222
make
sudo make install

Then an access to http://kaiko.getalp.org/sparql led to Virtuoso 42001 Error SR185: Undefined procedure WS.WS.SPARQL_ENDPOINT_GENERATE_FORM

I regenarated the whole db from scratch using the newly compiled virtuoso, same problem. Any idea ?

pkleef commented 2 years ago

When switching from stable/7 to develop/7 some of the build environment needs to be regenerated.

Please run the following command:

$ bash autogen.sh

Followed by the following commands to rebuild and install VOS:

$ make

$ sudo make install
serasset commented 2 years ago

Yes, I did that, but the distclean seams to keep some files. So I fetched a clean git clone and the compilation went well. I'll try the queries now.

serasset commented 2 years ago

With Virtuoso version 07.20.3233 on Linux (x86_64-pc-linux-gnu) Single Server Edition (142 GB total memory, 17 GB memory in use) :

Theoretical results (obtained through isql with no timeout) : Q1:

Through SPARQL endpoint without timeout Q1:

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=SELECT+%3Flemma%2C+COUNT%28%3Fwritten_trans%29+AS+%3Fnb_translations%0D%0AWHERE+%7B%0D%0A++++%3Flexentry+ontolex%3AcanonicalForm+%5B%0D%0A++++++++ontolex%3AwrittenRep+%3Flemma+%0D%0A++++%5D+.%0D%0A++++FILTER+%28lang%28%3Flemma%29+%3D+%27en%27%29%0D%0A++++%3Ftrans+dbnary%3AisTranslationOf+%3Flexentry+%3B%0D%0A++++++++dbnary%3AwrittenForm+%3Fwritten_trans+.%0D%0A%7D+LIMIT+1000%0D%0A&should-sponge=&format=text%2Fhtml&timeout=0&debug=on&run=+Run+Query+'

HTTP/1.1 200 OK
Date: Thu, 28 Apr 2022 06:11:47 GMT
Server: Virtuoso/07.20.3233 (Linux) x86_64-pc-linux-gnu
Accept-Ranges: bytes
Content-disposition: filename=sparql_2022-04-28_06-11-47Z.html
Content-Type: text/html; charset=UTF-8
Content-Length: 82941

curl -I   0.00s user 0.02s system 0% cpu 27.210 total

This is analoguous to the previous result and expected.

Q2:

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=select+distinct+%3FA+%3Fp+%3FB+where+%7B%3Fa+a%2Frdfs%3AsubClassOf*+%3FA.+%3Fa+%3Fp+%3Fb.+OPTIONAL+%7B+%3Fb+rdfs%3AsubClassOf+%3FB+%7D%7D&should-sponge=&format=text%2Fhtml&timeout=&debug=on&run=+Run+Query+'

HTTP/1.1 200 OK
Date: Thu, 28 Apr 2022 06:13:14 GMT
Server: Virtuoso/07.20.3233 (Linux) x86_64-pc-linux-gnu
Accept-Ranges: bytes
Content-disposition: filename=sparql_2022-04-28_06-13-14Z.html
Content-Type: text/html; charset=UTF-8
Content-Length: 314378

curl -I   0.00s user 0.01s system 0% cpu 1:00.34 total

Here the default server timeout is hit, but now the request does not fail anymore but returns partial results (~850 rows) without any adequate response headers. This is unexpected in this version while it was failing (expectedly) in previous version.

Through SPARQL endpoint with timeout Q1:

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=SELECT+%3Flemma%2C+COUNT%28%3Fwritten_trans%29+AS+%3Fnb_translations%0D%0AWHERE+%7B%0D%0A++++%3Flexentry+ontolex%3AcanonicalForm+%5B%0D%0A++++++++ontolex%3AwrittenRep+%3Flemma+%0D%0A++++%5D+.%0D%0A++++FILTER+%28lang%28%3Flemma%29+%3D+%27en%27%29%0D%0A++++%3Ftrans+dbnary%3AisTranslationOf+%3Flexentry+%3B%0D%0A++++++++dbnary%3AwrittenForm+%3Fwritten_trans+.%0D%0A%7D+LIMIT+1000%0D%0A&should-sponge=&format=text%2Fhtml&timeout=1001&debug=on&run=+Run+Query+'

HTTP/1.1 200 OK
Date: Thu, 28 Apr 2022 06:22:59 GMT
Server: Virtuoso/07.20.3233 (Linux) x86_64-pc-linux-gnu
Accept-Ranges: bytes
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout.  Activity:  4.156M rnd  1.426M seq  1.616M same seg   1.183M same pg  110.8K same par      0 disk      0 spec disk      0B /      0
X-Exec-Milliseconds: 2981
X-Exec-DB-Activity: 4.156M rnd  1.426M seq  1.616M same seg   1.183M same pg  110.8K same par      0 disk      0 spec disk      0B /      0 messages      0 fork
Content-disposition: filename=sparql_2022-04-28_06-22-59Z.html
Content-Type: text/html; charset=UTF-8
Content-Length: 82566

curl -I   0.00s user 0.01s system 0% cpu 3.080 total

This is analoguous to the previous result and expected. Partial results returned and response headers given back stating so.

Q2:

time curl -I 'http://kaiko.getalp.org/sparql?default-graph-uri=&query=select+distinct+%3FA+%3Fp+%3FB+where+%7B%3Fa+a%2Frdfs%3AsubClassOf*+%3FA.+%3Fa+%3Fp+%3Fb.+OPTIONAL+%7B+%3Fb+rdfs%3AsubClassOf+%3FB+%7D%7D&should-sponge=&format=text%2Fhtml&timeout=1001&debug=on&run=+Run+Query+'

HTTP/1.1 200 OK
Date: Thu, 28 Apr 2022 06:23:25 GMT
Server: Virtuoso/07.20.3233 (Linux) x86_64-pc-linux-gnu
Accept-Ranges: bytes
Content-disposition: filename=sparql_2022-04-28_06-23-25Z.html
Content-Type: text/html; charset=UTF-8
Content-Length: 72651

curl -I   0.00s user 0.01s system 1% cpu 1.709 total

This is unexpected as the results are indeed partial after hitting the timeout (~217 rows) but no headers is given to state so.

pkleef commented 2 years ago

We will attempt to recreate this issue in house and see what is going on.

serasset commented 2 years ago

If you need to recreate the exact dbnary server settings, all dataset is available at http://kaiko.getalp.org/static/ontolex/latest/ and the database is created using script available at https://gitlab.com/gilles.serasset/dbnary/-/blob/master/kaiko/virtuoso/rotate-virtuoso-db.sh

(Sorry, I do not have a docker setting available (yet).)