openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
854 stars 211 forks source link

Simple sparql federated/service queries are exponentially slower with larger result size #828

Open tlomb opened 5 years ago

tlomb commented 5 years ago

Dear virtuoso team,

I observed some slow execution time for SPARQL Federated queries using virtuoso-opensource (last tag v7.2.5.1).

I isolated the fact that even for a federated query without local patterns in the query, the execution time increases exponentially with the number of results obtained from the remote sparql endpoint.

I would expect the execution time to grow linearly with the number of received results, which is the case with e.g. blazegraph. See below:

Here is my simple test query:

SELECT * WHERE {
    SERVICE <https://sparql.uniprot.org/sparql> {
      SELECT * where {
        ?s ?p ?o .
      }
      LIMIT 1000
    }
}

...and in this query, I am changing LIMIT from 1000 to 80000 to have a subquery that returns more or less results and I measure the execution time.


Here are the timings on your demo site (virtuoso, with a higher timeout than your default 20000ms):

[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+1000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso1000.rdf

real    0m5.621s
user    0m0.004s
sys 0m0.012s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+10000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso10000.rdf

real    0m19.185s
user    0m0.012s
sys 0m0.034s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+20000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso20000.rdf

real    0m39.662s
user    0m0.009s
sys 0m0.027s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+30000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso30000.rdf

real    1m11.885s
user    0m0.007s
sys 0m0.037s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+40000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso40000.rdf

real    2m37.914s
user    0m0.007s
sys 0m0.038s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+50000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso50000.rdf

real    4m15.367s
user    0m0.015s
sys 0m0.040s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+60000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso60000.rdf

real    7m0.020s
user    0m0.015s
sys 0m0.039s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+70000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso70000.rdf

real    9m51.610s
user    0m0.012s
sys 0m0.063s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+80000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso80000.rdf

real    13m33.210s
user    0m0.018s
sys 0m0.060s
[user@server tmp]$ 

And here are the timings for the same queries on a BlazeGraph v2.1.4 triple store:

[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+1000%0D%0A++%7D%0D%0A%7D" > out_blazegraph1000.rdf

real    0m2.716s
user    0m0.003s
sys 0m0.010s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+10000%0D%0A++%7D%0D%0A%7D" > out_blazegraph10000.rdf

real    0m5.225s
user    0m0.004s
sys 0m0.027s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+20000%0D%0A++%7D%0D%0A%7D" > out_blazegraph20000.rdf

real    0m7.300s
user    0m0.006s
sys 0m0.044s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+30000%0D%0A++%7D%0D%0A%7D" > out_blazegraph30000.rdf

real    0m8.141s
user    0m0.006s
sys 0m0.060s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+40000%0D%0A++%7D%0D%0A%7D" > out_blazegraph40000.rdf

real    0m11.507s
user    0m0.022s
sys 0m0.057s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+50000%0D%0A++%7D%0D%0A%7D" > out_blazegraph50000.rdf

real    0m11.548s
user    0m0.010s
sys 0m0.084s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+60000%0D%0A++%7D%0D%0A%7D" > out_blazegraph60000.rdf

real    0m14.368s
user    0m0.019s
sys 0m0.099s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+70000%0D%0A++%7D%0D%0A%7D" > out_blazegraph70000.rdf

real    0m14.043s
user    0m0.015s
sys 0m0.112s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+80000%0D%0A++%7D%0D%0A%7D" > out_blazegraph80000.rdf

real    0m15.367s
user    0m0.027s
sys 0m0.138s
[user@server tmp]$ 

As you can see, getting 10000 remote results in virtuoso takes 19 seconds, but getting 8 times more results (80000) takes around 42 times longer (13 minutes and 33 seconds). For blazegraph 2.1.4, the timings are 5.2 and 15.3 seconds, respectively.

The above examples are on distant sparql endpoints, but trying the same query between two sparql endpoint running virtuoso (virtuoso-opensource v7.2.5.1) in the same data center gives similar timings, so it's not a network limitation issue it seems.

Can you please have a look at this?

tlomb commented 5 years ago

@HughWilliams @TallTed @pkleef : Did one of you had time to look at the above issue?