openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
855 stars 211 forks source link

Virtuoso imposes a limit of 2^20 = 1048576 results on HTTP response #700

Open saleem-muhammad opened 6 years ago

saleem-muhammad commented 6 years ago

This issue might not be new. I am running some large results queries by contacting virtuoso SPARQL endpoint from my java application. I have noticed that Virtuoso endpoint does not retrieve more than 1048576 results as HTTP response, no matter how large you set the ResultSetMaxRows in the configuration file. Is there a way to remove this limit? I have also noted that ISQL does not have such limit.

HughWilliams commented 6 years ago

1048576 is a known limit on the size of a Virtuoso result set, thus using limit and offset is the way to go if you really need this many results.

saleem-muhammad commented 6 years ago

Thanks for reply, The problem is that I want to benchmark engines based on large data ( having greater than 1 M results) queries. Thus, a single query cannot be split by adding Limit Offset. The queries are coming from standard benchmark and adding offset limit would affect the whole purpose of the said benchmarking.

kidehen commented 6 years ago

@saleem-muhammad,

You are dealing with an HTTP limitation. You can use ODBC or JDBC connections to Virtuoso that execute SPARQL queries too. The problem is that a document comprising a solution of 1 million+ tuples over HTTP is not the norm for any benchmark.

You can perform a variety of relational operations over databases of various sizes, but the solutions themselves do not amount to a dump of 1 million plus tuples (be it records in a table or statement graphs). Anyway, if you want to retrieve data progressively over HTTP, which is what we offer, then you have OFFSET and LIMIT; otherwise, you can push this all through alternative protocols like ODBC or JDBC.

I hope this helps.

saleem-muhammad commented 6 years ago

@kidehen Thanks indeed and it worked over JDBC. However, I think this is not an HTTP limitation as I am able to retrieve more than one million results for other triple stores over HTTP. The examples given at http://vos.openlinksw.com/owiki/wiki/VOS/VOSDownload are very helpful. The only problem now i can see is that other triple stores should also support JDBC or ODBC connections. Otherwise, the comparison would not be fair. In addition, experiments using large data queries (million of results) over live virtuoso SPARQL endpoints or using SPARQL federation engines is made tricky by this limitation.

pkleef commented 6 years ago

As a temporary workaround, edit the file

libsrc/Wi/sparql_io.sql

around line 3236 you will find the following

 maxrows := 1024*1024; -- More than enough for web-interface.

Change this to

maxrows := 10*1024*1024; -- More than enough for web-interface.

and recompile.

I am working on a permanent fix which will be committed to VOS probably tomorrow.

kidehen commented 6 years ago

@saleem-muhammad,

To be clear, I should have stated this was a Virtuoso HTTP interface limitation. Anyway, as per comment by @pkleef, the arbitrary limit can be increased.

iv-an-ru commented 6 years ago

Patrick,

I'm afraid that maxrows := 1010241024; -- More than enough for web-interface.is not safe, as the box length should be encoded by 3 bytes, so it's 1610241024, this gives safe margins maxrows := ((1610241024)/4)-2; on 32-bit builds (4 bytes per pointer), and 32 bit builds are possible only for VOS versions as old as 5.x and for 64 bit builds it is even smaller maxrows := ((1610241024)/8)-2;

This is reflected in Dk/Dkbox.h, lines ~56--57, depending on version,

define MAX_BOX_LENGTH ((size_t)0xFFFFFF)

define MAX_BOX_ELEMENTS (MAX_BOX_LENGTH/sizeof(void *))

Best Regards, Ivan

saleem-muhammad commented 6 years ago

thanks. i have changed the maxrows limit to 64*1024*1024-2 in the libsrc/Wi/sparql_io.sql and now I am able to get upto 20 million results. And get the following error beyond that.

Exception in thread "main" HttpException: 500
        at com.hp.hpl.jena.sparql.engine.http.HttpQuery.rewrap(HttpQuery.java:414)
        at com.hp.hpl.jena.sparql.engine.http.HttpQuery.execGet(HttpQuery.java:358)
        at com.hp.hpl.jena.sparql.engine.http.HttpQuery.exec(HttpQuery.java:295)
        at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execResultSetInner(QueryEngineHTTP.java:346)
        at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execSelect(QueryEngineHTTP.java:338)

I think it would be cool if this parameter is somehow matched to ResultSetMaxRows in the configuration file, i.e., virtuoso.ini

pfps commented 6 years ago

I don't see how you are getting 20 million results, as that is bigger than the MAX_BOX_ELEMENTS limit mentioned by iv-an-ru.