Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
I observed some slow execution time for SPARQL Federated queries using virtuoso-opensource (last tag v7.2.5.1).
I isolated the fact that even for a federated query without local patterns in the query, the execution time increases exponentially with the number of results obtained from the remote sparql endpoint.
I would expect the execution time to grow linearly with the number of received results, which is the case with e.g. blazegraph. See below:
Here is my simple test query:
SELECT * WHERE {
SERVICE <https://sparql.uniprot.org/sparql> {
SELECT * where {
?s ?p ?o .
}
LIMIT 1000
}
}
...and in this query, I am changing LIMIT from 1000 to 80000 to have a subquery that returns more or less results and I measure the execution time.
Here are the timings on your demo site (virtuoso, with a higher timeout than your default 20000ms):
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+1000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso1000.rdf
real 0m5.621s
user 0m0.004s
sys 0m0.012s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+10000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso10000.rdf
real 0m19.185s
user 0m0.012s
sys 0m0.034s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+20000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso20000.rdf
real 0m39.662s
user 0m0.009s
sys 0m0.027s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+30000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso30000.rdf
real 1m11.885s
user 0m0.007s
sys 0m0.037s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+40000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso40000.rdf
real 2m37.914s
user 0m0.007s
sys 0m0.038s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+50000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso50000.rdf
real 4m15.367s
user 0m0.015s
sys 0m0.040s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+60000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso60000.rdf
real 7m0.020s
user 0m0.015s
sys 0m0.039s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+70000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso70000.rdf
real 9m51.610s
user 0m0.012s
sys 0m0.063s
[user@server tmp]$ time curl -s "http://demo.openlinksw.com/sparql?default-graph-uri=&query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+80000%0D%0A++%7D%0D%0A%7D&should-sponge=&format=text%2Fhtml&timeout=1000000" > out_virtuoso80000.rdf
real 13m33.210s
user 0m0.018s
sys 0m0.060s
[user@server tmp]$
And here are the timings for the same queries on a BlazeGraph v2.1.4 triple store:
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+1000%0D%0A++%7D%0D%0A%7D" > out_blazegraph1000.rdf
real 0m2.716s
user 0m0.003s
sys 0m0.010s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+10000%0D%0A++%7D%0D%0A%7D" > out_blazegraph10000.rdf
real 0m5.225s
user 0m0.004s
sys 0m0.027s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+20000%0D%0A++%7D%0D%0A%7D" > out_blazegraph20000.rdf
real 0m7.300s
user 0m0.006s
sys 0m0.044s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+30000%0D%0A++%7D%0D%0A%7D" > out_blazegraph30000.rdf
real 0m8.141s
user 0m0.006s
sys 0m0.060s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+40000%0D%0A++%7D%0D%0A%7D" > out_blazegraph40000.rdf
real 0m11.507s
user 0m0.022s
sys 0m0.057s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+50000%0D%0A++%7D%0D%0A%7D" > out_blazegraph50000.rdf
real 0m11.548s
user 0m0.010s
sys 0m0.084s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+60000%0D%0A++%7D%0D%0A%7D" > out_blazegraph60000.rdf
real 0m14.368s
user 0m0.019s
sys 0m0.099s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+70000%0D%0A++%7D%0D%0A%7D" > out_blazegraph70000.rdf
real 0m14.043s
user 0m0.015s
sys 0m0.112s
[user@server tmp]$ time curl -s "http://rdf.geneontology.org/blazegraph/namespace/kb/sparql?query=select+*+where+%7B%0D%0A++SERVICE+%3Chttps%3A%2F%2Fsparql.uniprot.org%2Fsparql%3E+%7B%0D%0A++++select+*+where+%7B%0D%0A++++++%3Fs+%3Fp+%3Fo+.%0D%0A++++%7D%0D%0A++++LIMIT+80000%0D%0A++%7D%0D%0A%7D" > out_blazegraph80000.rdf
real 0m15.367s
user 0m0.027s
sys 0m0.138s
[user@server tmp]$
As you can see, getting 10000 remote results in virtuoso takes 19 seconds, but getting 8 times more results (80000) takes around 42 times longer (13 minutes and 33 seconds). For blazegraph 2.1.4, the timings are 5.2 and 15.3 seconds, respectively.
The above examples are on distant sparql endpoints, but trying the same query between two sparql endpoint running virtuoso (virtuoso-opensource v7.2.5.1) in the same data center gives similar timings, so it's not a network limitation issue it seems.
Dear virtuoso team,
I observed some slow execution time for SPARQL Federated queries using virtuoso-opensource (last tag v7.2.5.1).
I isolated the fact that even for a federated query without local patterns in the query, the execution time increases exponentially with the number of results obtained from the remote sparql endpoint.
I would expect the execution time to grow linearly with the number of received results, which is the case with e.g. blazegraph. See below:
Here is my simple test query:
...and in this query, I am changing
LIMIT
from1000
to80000
to have a subquery that returns more or less results and I measure the execution time.Here are the timings on your demo site (virtuoso, with a higher timeout than your default
20000ms
):And here are the timings for the same queries on a BlazeGraph v2.1.4 triple store:
As you can see, getting 10000 remote results in virtuoso takes 19 seconds, but getting 8 times more results (80000) takes around 42 times longer (13 minutes and 33 seconds). For blazegraph 2.1.4, the timings are 5.2 and 15.3 seconds, respectively.
The above examples are on distant sparql endpoints, but trying the same query between two sparql endpoint running virtuoso (virtuoso-opensource v7.2.5.1) in the same data center gives similar timings, so it's not a network limitation issue it seems.
Can you please have a look at this?