Virtuoso 22023 Error SR...: The result vector is too large

jakubklimek commented 10 years ago

Virtuoso 22023 Error SR...: The result vector is too large

SPARQL query:
PREFIX s: <http://schema.org/>
CONSTRUCT {?address ?p ?o}
WHERE 
{
?address a s:PostalAddress ;
    ?p ?o . 
}

HughWilliams commented 10 years ago

Do you have a minimum sample dataset that can be used to recreate this issue ?

jakubklimek commented 10 years ago

Unfortunately not. However, now it seems to me that sometimes after this error, Virtuoso crashes with nothing in log (its process ends).

HughWilliams commented 10 years ago

I ran the construct query above against the v6 & v7 vos builds with the sample dataset you provided in issue#118 and both return 10001 rows:

SQL> SPARQL PREFIX s: http://schema.org/ CONSTRUCT {?address ?p ?o} WHERE  { ?address a s:PostalAddress ;     ?p ?o .  };
S        P        O
VARCHAR  VARCHAR  VARCHAR

---

http://linked.opendata.cz/resource/business-entity/CZ00250091/hq-address  http://schema.org/addressRegion  Sepekov
http://linked.opendata.cz/resource/business-entity/CZ01709453/hq-address  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/PostalAddress
http://linked.opendata.cz/resource/business-entity/CZ00869996/hq-address  http://schema.org/streetAddress  Husova 58, č.p. 741
.
.
.
http://linked.opendata.cz/resource/business-entity/CZ00109975/hq-address  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/PostalAddress
http://linked.opendata.cz/resource/domain/ares/person/1975-06-23/%25C4%258Derm%25C3%25A1k-martin/address  http://schema.org/postalCode  27345
http://linked.opendata.cz/resource/business-entity/CZ00109916/hq-address  http://www.w3.org/1999/02/22-rdf-syntax-ns#type  http://schema.org/PostalAddress

10001 Rows. -- 8986 msec.
SQL>

jakubklimek commented 10 years ago

I ran into this problem again on another Virtuoso instance (query)

I admit it queries a rather large dataset (600M triples) but that is not a reason for not being able to answer this query.

HughWilliams commented 10 years ago

When I run the query in the link above it returns:

# Empty TURTLE

What is the expected result ?

jakubklimek commented 10 years ago

Sorry, changed URIs in the endpoint a bit. Try it now.

openlink commented 10 years ago

Can you upgrade to the latest version on develop/7 and try again please.

jakubklimek commented 10 years ago

Still happening with 788bb9da315b19605856091726079338320e6212

openlink commented 10 years ago

Please add the following parameter to your virtuoso.ini file, restart the database and try one more time:

[Parameters]
…
VectorSize = 1000
...

jakubklimek commented 10 years ago

Added, no change. virtuoso.ini Parameters section:


[Parameters]
ServerPort                      = 1111
LiteMode                        = 0
DisableUnixSocket               = 1
DisableTcpSocket                = 0
;SSLServerPort                  = 2111
;SSLCertificate                 = cert.pem
;SSLPrivateKey                  = pk.pem
;X509ClientVerify               = 0
;X509ClientVerifyDepth          = 0
;X509ClientVerifyCAFile         = ca.pem
ServerThreads                   = 20
CheckpointInterval              = 60
O_DIRECT                        = 0
CaseMode                        = 2
MaxStaticCursorRows             = 5000
CheckpointAuditTrail            = 0
AllowOSCalls                    = 0
SchedulerInterval               = 10
DirsAllowed                     = ., /data/virtuoso/upload, /usr/local/share/virtuoso/vad
ThreadCleanupInterval           = 0
ThreadThreshold                 = 10
ResourcesCleanupInterval        = 0
FreeTextBatchSize               = 100000
SingleCPU                       = 0
VADInstallDir                   = /usr/local/share/virtuoso/vad/
PrefixResultNames               = 0
RdfFreeTextRulesSize            = 100
IndexTreeMaps                   = 256
MaxMemPoolSize                  = 200000000
PrefixResultNames               = 0
MacSpotlight                    = 0
IndexTreeMaps                   = 64
DefaultIsolation                = 2
MaxQueryMem                     = 1G
MaxVectorSize                   = 4000000
VectorSize                      = 1000
NumberOfBuffers=1360000
MaxDirtyBuffers=1000000

jakubklimek commented 10 years ago

Still happening with the newest version 148e7c910eee8fd62c92647a6bd98c21dcece9da

Instance: http://ruian.linked.opendata.cz:8890/sparql

Query:


PREFIX s: <http://schema.org/>
PREFIX gml: <http://www.opengis.net/ont/gml#>
PREFIX ruian: <http://ruian.linked.opendata.cz/ontology/>
CONSTRUCT {?point ?p ?o}
FROM <http://ruian.linked.opendata.cz/resource/dataset>
FROM <http://ruian.linked.opendata.cz/resource/dataset/geocoding/krovak2WGS84>
WHERE
{
?point a gml:Point ;
    ^ruian:adresniBod ?misto;
    ?p ?o .
?misto a ruian:AdresniMisto ;
    ruian:stavebniObjekt/ruian:castObce/ruian:obec/ruian:pou/ruian:orp/ruian:vusc/ruian:regionSoudrznosti ?rs.
VALUES ?rs {<http://ruian.linked.opendata.cz/resource/regiony-soudrznosti/51> }
}

HughWilliams commented 10 years ago

If construct is replaced with select does the query then run, i.e., return results ?

jakubklimek commented 10 years ago

Actually yes, when I change it to select, it returns results.

HughWilliams commented 10 years ago

HI Jakub,

Are you still unable to provide a dataset to enable this to be reproduced as if we have a test case then it is easiest for development to fix?

BTW, does the error occur when running the explain() function on the query ?

jakubklimek commented 10 years ago

I am able to reproduce with: http://internal.opendata.cz/db04.zip (145 MB) Query:

construct {?s ?p ?o} from <http://linked.opendata.cz/resource/dataset/mfcr/rozvaha> where {?s ?p ?o}

jakubklimek commented 10 years ago

explain('sparql construct {?s ?p ?o} from http://linked.opendata.cz/resource/dataset/mfcr/rozvaha where {?s ?p ?o}'); returns

REPORT
VARCHAR
 {

 Precode:
  0: vector := Call vector ( 1 , 0 , 1 , 1 , 1 , 2 )
  5: vector := Call vector (vector)
  10: vector := Call vector ()
  15: BReturn 0
 { fork
 RDF_QUAD_GS 20 rows(s_1_4_t1.S)
  inlined G = #/rozvaha
 RDF_QUAD_SP 10 rows(s_1_4_t1.P)
  inlined S = s_1_4_t1.S
 RDF_QUAD 0.83 rows(s_1_4_t1.O)
  inlined P = k_s_1_4_t1.P , S = k_s_1_4_t1.S G = #/rozvaha

 After code:
  0: __ro2lo := Call __ro2lo (s_1_4_t1.O)
  5: vector := Call vector (s_1_4_t1.S, s_1_4_t1.P, __ro2lo)
  10: if ($52 "user_aggr_notfirst" = 1 ) then 25 else 14 unkn 14
  14: $52 "user_aggr_notfirst" := := artm 1
  18: user_aggr_ret := Call DB.DBA.SPARQL_CONSTRUCT_INIT ($53 "user_aggr_env")
  25: user_aggr_ret := Call DB.DBA.SPARQL_CONSTRUCT_ACC ($53 "user_aggr_env", vector, vector, vector, 1 )
  32: BReturn 0
 }
 skip node 1  set_ctr

 After code:
  0: callret-0 := Call DB.DBA.SPARQL_CONSTRUCT_FIN ($53 "user_aggr_env")
  7: BReturn 0
 Select (callret-0)
 }

HughWilliams commented 10 years ago

I have setup your test database and have been able to recreate the error executing the construct query against the SPARQL endpoint:

Virtuoso 22023 Error SR...: The result vector is too large

SPARQL query: construct {?s ?p ?o} from http://linked.opendata.cz/resource/dataset/mfcr/rozvaha where {?s ?p ?o}

Although, against the isql command line tool the query does return results, which is basically the complete result of triples in the QUAD store which is 24+million triples:

SQL> sparql select count(*) from http://linked.opendata.cz/resource/dataset/mfcr/rozvaha where {?s ?p ?o};
callret-0
INTEGER

---

24010658

1 Rows. -- 6874 msec.
SQL>

So is this a reasonable test case as returning a 24+million result set does not seem reasonable from a sparql endpoint. As a reasonably managed endpoint, like the DBpedia endpoint, we host limit the size of result set that can be returned to about 10K, to protect against abusive use ...

How many triples does the following query you indicated gave the error return?

PREFIX s: http://schema.org/
PREFIX gml: http://www.opengis.net/ont/gml#
PREFIX ruian: http://ruian.linked.opendata.cz/ontology/
CONSTRUCT {?point ?p ?o}
FROM http://ruian.linked.opendata.cz/resource/dataset
FROM http://ruian.linked.opendata.cz/resource/dataset/geocoding/krovak2WGS84
WHERE
{
?point a gml:Point ;
    ^ruian:adresniBod ?misto;
    ?p ?o .
?misto a ruian:AdresniMisto ;
    ruian:stavebniObjekt/ruian:castObce/ruian:obec/ruian:pou/ruian:orp/ruian:vusc/ruian:regionSoudrznosti ?rs.
VALUES ?rs {http://ruian.linked.opendata.cz/resource/regiony-soudrznosti/51 }
}

jakubklimek commented 10 years ago

We use Virtuoso in our tools UnifiedViews (part of LOD2 stack) and Payola as a data source. These are ETL and analytic tools, which usually download and transform large datasets using SPARQL. This operation does not have to be fast, but it has to return complete results eventually. As Virtuoso claims to be an RDF database, it seems strange to me that there would be a limit of reasonable number of triples to be returned. We work with datasets with nearly 1B triples and we need to do linking and transformations on them using SPARQL, which sometimes includes downloading large parts of them for processing elsewhere. Imagine a regular SQL database in a data warehouse... would it also be unreasonable to return 24M rows there? I think that there should not be such a limit for a database, which includes an RDF database. 10K triples limit is OK when you use the endpoint for browsing via faceted browser or something similar, but it is not OK for transformations of whole datasets.

On the other hand, if there really is some official number of triples that would be considered unreasonable for Virtuoso, please let me know so that I can look for another solution.

Also acceptable would be a corresponding setting in virtuoso.ini, if that would help something. There is VectorSize and MaxVectorSize but this does not affect this issue.

To answer your question, the other query is supposed to return approx. 50-100M of triples.

indeyets commented 10 years ago

This operation does not have to be fast, but it has to return complete results eventually.

huge +1 on this

HughWilliams commented 10 years ago

The result of CONSTRUCT is a serialized text, not a large set of short rows, so it becomes a single BLOB stream in memory when queried via the SPARQL endpoint and this can consume all system memory for large result sets, which is why there is a limit on its size via the HTTP endpoint, but not via isql which will automatically use a cursor for splitting the result into pages.

Question from Orri: "what other RDF Stores handles such large construct queries via there SPARQL endpoint?"

So, Orri suggests if you really want to use construct, do this via one of the SQL client CLIs (i.e., ODBC, JDBC, Jena, Sesame, etc.) which can handle such large results. Or create an insert query that writes the required results to a graph in virtuoso which can either be dumped to data sets and loaded into your application (UnifiedViews etc) or it can query them and feed into the processing pipeline.

jakubklimek commented 10 years ago

Thanks for the explanation, that finally brings some light into this. The question is of course whether the construct query could not be implemented in another way... like for n-quads format, which could surely be stored as a table and not a BLOB.

As to Orri's question, Apache Jena Fuseki running on Jena TDB works with these queries... specifically for this one returns a 4GB turtle file in 700 seconds.

kidehen commented 10 years ago

On 8/14/14 6:08 AM, jakubklimek wrote:

We use Virtuoso in our tools UnifiedViews http://unifiedeviews.eu (part of LOD2 stack) and Payola http://live.payola.cz as a data source. These are ETL and anaytic tools, which usually download and transform large datasets using SPARQL. This operation does not have to be fast, but it has to return complete results eventually. As Virtuoso claims to be an RDF database, it seems strange to me that there would be a limit of reasonable number of triples to be returned.

Virtuoso doesn't have a hard-limit in regards to the size of a SPARQL query solution (result set). It has configurable parameters that enable you control control solution size in relation to query timeouts. Put differently, SPARQL query processing is like a quiz contest where Virtuoso is a contestant with a configurable timeout for answering a question. If the query solution's size and complexity exceeds the allotted time, it can simply provide a partial solution, which is indicated to you via HTTP response headers.

We work with datasets with nearly 1B triples and we need to do linking and transformations on them using SPARQL, which sometimes includes downloading large parts of them for processing elsewhere. Imagine a regular SQL database in a data warehouse... would it also be unreasonable to return 24M rows there?

In any DBMS 24 million records has to come from somewhere and end up somewhere, even if this is all within the operation of said DBMS. Virtuoso is no different i.e., you can import loads of data, transform it, and dispatch it somewhere else if you choose.

I think that there should not be such a limit for a database, which includes an RDF database. 10K triples limit is OK when you use the endpoint for browsing via faceted browser or some thing si milar, but it is not OK for transformations of whole datasets.

Why are you assuming that this is the case? Virtuoso has many interfaces (SPARQL and/or SQL).

On the other hand, if there really is some official number of triples that would be considered unreasonable for Virtuoso, please let me know so that I can look for another solution.

I don't know how you've arrived at these conclusions. Virtuoso is actually a sophisticated DBMS that happens to support SPARQL and SQL query languages, amongst many other capabilities. ETL isn't alien territory to Virtuoso which is also a very sophisticated Virtual DBMS and/or warehouse.

Also acceptable would be a corresponding setting in virtuoso.ini, if that would help something. There is VectorSize and MaxVectorSize but this does not affect this issue.

What is the issue?

To answer your question, the other query is supposed to return approx. 50-100M of triples.

You can import whatever you want, subject to configuration and host operating system and computing resources. You can transform what's inside Virtuoso, and export it wherever you want. They key thing is to understand what you are trying to achieve and how it is achieved reasonably.

kidehen commented 10 years ago

On 8/15/14 6:53 AM, jakubklimek wrote:

Thanks for the explanation, that finally brings some light into this. The question is of course whether the construct query could not be implemented in another way... like for n-quads format, which could surely be stored as a table and not a BLOB.

As to Orri's question, Apache Jena Fuseki running on Jena TDB works with these queries... specifically for this one returns a 4GB turtle file in 700 seconds.

You can dump the same data from Virtuoso to a Named Graph or OS file (using the RDF dump feature).

I don't believe you simply execute a CONSTRUCT against any SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).

Outline your Jena TDB sequence (bearing in mind its Java and JDBC centricity) workflow, and we can respond accordingly, assuming the earlier response from Hugh/Orri is still unclear.

indeyets commented 10 years ago

What is the issue?

@kidehen did you look at github page? the issue is this error: Virtuoso 22023 Error SR...: The result vector is too large

I don't believe you simply execute a CONSTRUCT against and SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).

the question is not about overcoming limitations — sure, there are some other ways. the question is about executing a standards-compliant query (sparql-query via sparql/http protocol and getting result). right now engine chokes with an error. it could stream the data instead

jakubklimek commented 10 years ago

Virtuoso doesn't have a hard-limit in regards to the size of a SPARQL query solution (result set). It has configurable parameters that enable you control control solution size in relation to query timeouts. Put differently, SPARQL query processing is like a quiz contest where Virtuoso is a contestant with a configurable timeout for answering a question. If the query solution's size and complexity exceeds the allotted time, it can simply provide a partial solution, which is indicated to you via HTTP response headers.

@kidehen That is exactly the issue. Here we hit some unspecified limit on "vector size" that cannot be set anywhere.

In any DBMS 24 million records has to come from somewhere and end up somewhere, even if this is all within the operation of said DBMS. Virtuoso is no different i.e., you can import loads of data, transform it, and dispatch it somewhere else if you choose.

Yes, but here we are talking about doing this using SPARQL and it seems that Virtuoso currently has issues with that. I know I can dump the data, query using SQL etc. but we are still talking about SPARQL, because these transformations should be independent of any particular triplestore implementation and hence done using SPARQL - that is what it is here for.

What is the issue?

See the original git issue. It is that the query evaluation crashes on some limit that cannot be set anywhere and it is not clear what exactly it is.

I don't believe you simply execute a CONSTRUCT against and SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).

Nevertheless, it is the case. I execute the query using wget, accessing Jena's SPARQL HTTP endpoint (Fuseki) and the solution is streamed and stored to a file.

kidehen commented 10 years ago

On 8/15/14 10:29 AM, Alexey Zakhlestin wrote:

What is the issue?

@kidehen https://github.com/kidehen did you look at github page? the issue is this error: Virtuoso 22023 Error SR...: The result vector is too large

It is too large for the default output target. Ideally, SPARQL would include an option to output to variety of target, which would open up the door for streams, etc., but that option doesn't exist, so we all implement workarounds.

I don't believe you simply execute a CONSTRUCT against and SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).

the question is not about overcoming limitations — sure, there are some other ways. the question is about executing a standards-compliant query (sparql-query via sparql/http protocol and getting result).

If you have a SPARQL Protocol URL comprised of a CONSTRUCT that results in output, of the magnitude requested, that works, then show me and it will be implemented.

Simple example:

curl -O {SPARQL-Protocol-URL-for-Construct-returning-what-is-sought-here} .

right now engine chokes with an error. it could stream the data instead

Hugh/Orri/Ivan/Mitko:

I should be able to chunk the data via HTTP in regards to the example above. It can even gzip the chunked output. That's just about an internal TCN rule scoped to this kind of request.

jakubklimek commented 10 years ago

If you have a SPARQL Protocol URL comprised of a CONSTRUCT that results in output, of the magnitude requested, that works, then show me and it will be implemented.

this gets you a 4GB ttl file from Jena TDB + Fuseki:

wget http://v7.xrg.cz:3030/mfcr/sparql?query=CONSTRUCT+%0A++%7B+%3Fs+%3Fp+%3Fo+.%7D%0AFROM+%3Chttp%3A%2F%2Flinked.opendata.cz%2Fresource%2Fdataset%2Fmfcr%2Frozvaha%3E%0AWHERE%0A++%7B+%3Fs+%3Fp+%3Fo+%7D%0A

And the same query on the same data using Virtuoso gives you the error: http://internal.opendata.cz:8901/sparql?query=CONSTRUCT+%0A++%7B+%3Fs+%3Fp+%3Fo+.%7D%0AFROM+%3Chttp%3A%2F%2Flinked.opendata.cz%2Fresource%2Fdataset%2Fmfcr%2Frozvaha%3E%0AWHERE%0A++%7B+%3Fs+%3Fp+%3Fo+%7D%0A

kidehen commented 10 years ago

On 8/15/14 11:36 AM, jakubklimek wrote:

Nevertheless, it is the case. I execute the query using wget, accessing Jena's SPARQL HTTP endpoint (Fuseki) and the solution is streamed and stored to a file.

curl -O {sparql-protocol-url} or wget {sparql-protocol-url} should work.

Virtuoso supports gzipped content and chunking over HTTP, so this might just be a case of creating a default re-write rule based on a transparent content negotiation (TCN) algorithm.

kidehen commented 10 years ago

On 8/15/14 11:48 AM, jakubklimek wrote:

If you have a SPARQL Protocol URL comprised of a CONSTRUCT that results in output, of the magnitude requested, that works, then show me and it will be implemented.

wget http://v7.xrg.cz:3030/mfcr/sparql?query=CONSTRUCT+%0A++%7B+%3Fs+%3Fp+%3Fo+.%7D%0AFROM+%3Chttp%3A%2F%2Flinked.opendata.cz%2Fresource%2Fdataset%2Fmfcr%2Frozvaha%3E%0AWHERE%0A++%7B+%3Fs+%3Fp+%3Fo+%7D%0A

As per my last comments, we need to make a TCN based re-write rule for a SPARQL URL that includes a new parameter (serving as the data streaming hint) for dispatching the query solution to the requestor (ideally compressed and delivered in chunks).

[1] chunking -- http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html [2] see 14.3 for compressed data exchange headers -- http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

IvanMikhailov commented 10 years ago

Virtuoso's limit for vectors is ~2500000 items for 32-bit builds and 1250000 items for 64-bit builds. Thus, a really long CONSTRUCT can run out of the limit for sure. SPARQL INSERT and SPARQL DELETE will not meet that limit because they don't have to get a sequence of things. CONSTRUCT will also bypass the limit if an ODBC/UDBC/IODBC is used to execute a SPARQL CONSTRUCT statement, not a web service endpoint: all these protocols will deliver the result of the CONSTRUCT as result set of three columns S,P,O (or four columns if it's SPARQL 1.1 CONSTRUCT with GRAPH ctor templates). The error to fix is crash after the error, this clearly should not happen.

jakubklimek commented 10 years ago

Virtuoso's limit for vectors is ~2500000 items for 32-bit builds and 1250000 items for 64-bit builds. Thus, a really long CONSTRUCT can run out of the limit for sure.

Does this mean that there is no way around this error and Virtuoso is unable to return more items through a SPARQL Endpoint while other triplestores can do that?

HughWilliams commented 9 years ago

There is an internal ticket for development to provide a fix to enable such large results to be created via the /sparql endpoint

lagivan commented 8 years ago

Is there any progress with this? I'm facing a strange behavior that seems to be related to this issue. I have a named graph with 691461 triples. Attempts to export the complete graph always return only 10001 triples.

I've tried two ways:

Exporting via Graph Store HTTP Protocol using HTTP GET request
Running a construct query on the SPARQL endpoint:

construct {?s ?p ?o} from <http://data.cubiss.nl/muzieklijstjes/> where {?s ?p ?o}

P.S. Using Virtuoso version 07.20.3212

bertvannuffelen commented 7 years ago

+1 to the status request on this topic.

antonisloizou commented 7 years ago

Also running into this issue. Would be nice to get an update!

imitko commented 7 years ago

dump of a large graphs is supposed to be done using instructions in : http://docs.openlinksw.com/virtuoso/rdfperfdumpandreloadgraphs/

antonisloizou commented 7 years ago

Unfortunately , the procedure seems to not exist:

*** Error 42001: VD [Virtuoso Server]SR185: Undefined procedure DB.DBA.dump_one_graph.

I tried creating it using the source in http://docs.openlinksw.com/virtuoso/rdfperfdumpandreloadgraphs/, however I then get the error:

*** Error 22023: VD [Virtuoso Server]SR601: Argument 0 of http_ttl_triple() should be an array of special format at line 110 of Top-Level: dump_one_graph('http://diogenes/proteomics_ms', './diogenes_proteomicsms', 1000000000)

Any further insights ?

imitko commented 7 years ago

please change references like this -- env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0); -- to -- env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0, 0); i.e., the env have to have ten elements

bertvannuffelen commented 6 years ago

Is there any progress here?

I get the following error now

Virtuoso RDFXX Error Content length 21561445 is over the limit 20971520

SPARQL query: define sql:big-data-const 0

output-format:text/html

define sql:signal-void-variables 1 load http://id.vlaanderen.be/vodap_validator/results/cache/Thursday/2e321f85fc0d0eaf28de8201ebb8a9c9.rdf into http://data.vlaanderen.be/id/dataset/2017-09-28T22:34:05Z

I assume this is the same issue?

Bert

bertvannuffelen commented 6 years ago

Please ignore my previous message, I found the parameter in the docs. http://docs.openlinksw.com/virtuoso/dbadm/

MaxDataSourceSize = 20971520 . Controls the max size that can be sponged. Default is 20 MB.

jakubklimek commented 5 years ago

@HughWilliams OK, 6 years later, and I am still running into this issue. Any progress?

HughWilliams commented 5 years ago

Checking with development if they completed a fix for this issue as it was planned to be fixed ...

jakubklimek commented 3 years ago

@HughWilliams any news regarding this? now it is 7 years...

openlink / virtuoso-opensource

Virtuoso 22023 Error SR...: The result vector is too large #119

output-format:text/html