Open jakubklimek opened 10 years ago
Do you have a minimum sample dataset that can be used to recreate this issue ?
Unfortunately not. However, now it seems to me that sometimes after this error, Virtuoso crashes with nothing in log (its process ends).
I ran the construct query above against the v6 & v7 vos builds with the sample dataset you provided in issue#118 and both return 10001 rows:
SQL> SPARQL PREFIX s: http://schema.org/ CONSTRUCT {?address ?p ?o} WHERE { ?address a s:PostalAddress ; ?p ?o . };
S P O
VARCHAR VARCHAR VARCHAR
---
http://linked.opendata.cz/resource/business-entity/CZ00250091/hq-address http://schema.org/addressRegion Sepekov
http://linked.opendata.cz/resource/business-entity/CZ01709453/hq-address http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/PostalAddress
http://linked.opendata.cz/resource/business-entity/CZ00869996/hq-address http://schema.org/streetAddress Husova 58, č.p. 741
.
.
.
http://linked.opendata.cz/resource/business-entity/CZ00109975/hq-address http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/PostalAddress
http://linked.opendata.cz/resource/domain/ares/person/1975-06-23/%25C4%258Derm%25C3%25A1k-martin/address http://schema.org/postalCode 27345
http://linked.opendata.cz/resource/business-entity/CZ00109916/hq-address http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/PostalAddress
10001 Rows. -- 8986 msec.
SQL>
I ran into this problem again on another Virtuoso instance (query)
I admit it queries a rather large dataset (600M triples) but that is not a reason for not being able to answer this query.
When I run the query in the link above it returns:
# Empty TURTLE
What is the expected result ?
Sorry, changed URIs in the endpoint a bit. Try it now.
Can you upgrade to the latest version on develop/7 and try again please.
Still happening with 788bb9da315b19605856091726079338320e6212
Please add the following parameter to your virtuoso.ini file, restart the database and try one more time:
[Parameters]
…
VectorSize = 1000
...
Added, no change.
virtuoso.ini
Parameters section:
[Parameters]
ServerPort = 1111
LiteMode = 0
DisableUnixSocket = 1
DisableTcpSocket = 0
;SSLServerPort = 2111
;SSLCertificate = cert.pem
;SSLPrivateKey = pk.pem
;X509ClientVerify = 0
;X509ClientVerifyDepth = 0
;X509ClientVerifyCAFile = ca.pem
ServerThreads = 20
CheckpointInterval = 60
O_DIRECT = 0
CaseMode = 2
MaxStaticCursorRows = 5000
CheckpointAuditTrail = 0
AllowOSCalls = 0
SchedulerInterval = 10
DirsAllowed = ., /data/virtuoso/upload, /usr/local/share/virtuoso/vad
ThreadCleanupInterval = 0
ThreadThreshold = 10
ResourcesCleanupInterval = 0
FreeTextBatchSize = 100000
SingleCPU = 0
VADInstallDir = /usr/local/share/virtuoso/vad/
PrefixResultNames = 0
RdfFreeTextRulesSize = 100
IndexTreeMaps = 256
MaxMemPoolSize = 200000000
PrefixResultNames = 0
MacSpotlight = 0
IndexTreeMaps = 64
DefaultIsolation = 2
MaxQueryMem = 1G
MaxVectorSize = 4000000
VectorSize = 1000
NumberOfBuffers=1360000
MaxDirtyBuffers=1000000
Still happening with the newest version 148e7c910eee8fd62c92647a6bd98c21dcece9da
Instance: http://ruian.linked.opendata.cz:8890/sparql
Query:
PREFIX s: <http://schema.org/>
PREFIX gml: <http://www.opengis.net/ont/gml#>
PREFIX ruian: <http://ruian.linked.opendata.cz/ontology/>
CONSTRUCT {?point ?p ?o}
FROM <http://ruian.linked.opendata.cz/resource/dataset>
FROM <http://ruian.linked.opendata.cz/resource/dataset/geocoding/krovak2WGS84>
WHERE
{
?point a gml:Point ;
^ruian:adresniBod ?misto;
?p ?o .
?misto a ruian:AdresniMisto ;
ruian:stavebniObjekt/ruian:castObce/ruian:obec/ruian:pou/ruian:orp/ruian:vusc/ruian:regionSoudrznosti ?rs.
VALUES ?rs {<http://ruian.linked.opendata.cz/resource/regiony-soudrznosti/51> }
}
If construct
is replaced with select
does the query then run, i.e., return results ?
Actually yes, when I change it to select, it returns results.
HI Jakub,
Are you still unable to provide a dataset to enable this to be reproduced as if we have a test case then it is easiest for development to fix?
BTW, does the error occur when running the explain()
function on the query ?
I am able to reproduce with: http://internal.opendata.cz/db04.zip (145 MB) Query:
construct {?s ?p ?o} from <http://linked.opendata.cz/resource/dataset/mfcr/rozvaha> where {?s ?p ?o}
explain('sparql construct {?s ?p ?o} from http://linked.opendata.cz/resource/dataset/mfcr/rozvaha where {?s ?p ?o}');
returns
REPORT
VARCHAR
{
Precode:
0: vector := Call vector ( 1 , 0 , 1 , 1 , 1 , 2 )
5: vector := Call vector (vector)
10: vector := Call vector ()
15: BReturn 0
{ fork
RDF_QUAD_GS 20 rows(s_1_4_t1.S)
inlined G = #/rozvaha
RDF_QUAD_SP 10 rows(s_1_4_t1.P)
inlined S = s_1_4_t1.S
RDF_QUAD 0.83 rows(s_1_4_t1.O)
inlined P = k_s_1_4_t1.P , S = k_s_1_4_t1.S G = #/rozvaha
After code:
0: __ro2lo := Call __ro2lo (s_1_4_t1.O)
5: vector := Call vector (s_1_4_t1.S, s_1_4_t1.P, __ro2lo)
10: if ($52 "user_aggr_notfirst" = 1 ) then 25 else 14 unkn 14
14: $52 "user_aggr_notfirst" := := artm 1
18: user_aggr_ret := Call DB.DBA.SPARQL_CONSTRUCT_INIT ($53 "user_aggr_env")
25: user_aggr_ret := Call DB.DBA.SPARQL_CONSTRUCT_ACC ($53 "user_aggr_env", vector, vector, vector, 1 )
32: BReturn 0
}
skip node 1 set_ctr
After code:
0: callret-0 := Call DB.DBA.SPARQL_CONSTRUCT_FIN ($53 "user_aggr_env")
7: BReturn 0
Select (callret-0)
}
I have setup your test database and have been able to recreate the error executing the construct
query against the SPARQL endpoint:
Virtuoso 22023 Error SR...: The result vector is too large
SPARQL query: construct {?s ?p ?o} from http://linked.opendata.cz/resource/dataset/mfcr/rozvaha where {?s ?p ?o}
Although, against the isql
command line tool the query does return results, which is basically the complete result of triples in the QUAD store which is 24+million triples:
SQL> sparql select count(*) from http://linked.opendata.cz/resource/dataset/mfcr/rozvaha where {?s ?p ?o};
callret-0
INTEGER
---
24010658
1 Rows. -- 6874 msec.
SQL>
So is this a reasonable test case as returning a 24+million result set does not seem reasonable from a sparql endpoint. As a reasonably managed endpoint, like the DBpedia endpoint, we host limit the size of result set that can be returned to about 10K, to protect against abusive use ...
How many triples does the following query you indicated gave the error return?
PREFIX s: http://schema.org/
PREFIX gml: http://www.opengis.net/ont/gml#
PREFIX ruian: http://ruian.linked.opendata.cz/ontology/
CONSTRUCT {?point ?p ?o}
FROM http://ruian.linked.opendata.cz/resource/dataset
FROM http://ruian.linked.opendata.cz/resource/dataset/geocoding/krovak2WGS84
WHERE
{
?point a gml:Point ;
^ruian:adresniBod ?misto;
?p ?o .
?misto a ruian:AdresniMisto ;
ruian:stavebniObjekt/ruian:castObce/ruian:obec/ruian:pou/ruian:orp/ruian:vusc/ruian:regionSoudrznosti ?rs.
VALUES ?rs {http://ruian.linked.opendata.cz/resource/regiony-soudrznosti/51 }
}
We use Virtuoso in our tools UnifiedViews (part of LOD2 stack) and Payola as a data source. These are ETL and analytic tools, which usually download and transform large datasets using SPARQL. This operation does not have to be fast, but it has to return complete results eventually. As Virtuoso claims to be an RDF database, it seems strange to me that there would be a limit of reasonable number of triples to be returned. We work with datasets with nearly 1B triples and we need to do linking and transformations on them using SPARQL, which sometimes includes downloading large parts of them for processing elsewhere. Imagine a regular SQL database in a data warehouse... would it also be unreasonable to return 24M rows there? I think that there should not be such a limit for a database, which includes an RDF database. 10K triples limit is OK when you use the endpoint for browsing via faceted browser or something similar, but it is not OK for transformations of whole datasets.
On the other hand, if there really is some official number of triples that would be considered unreasonable for Virtuoso, please let me know so that I can look for another solution.
Also acceptable would be a corresponding setting in virtuoso.ini
, if that would help something. There is VectorSize
and MaxVectorSize
but this does not affect this issue.
To answer your question, the other query is supposed to return approx. 50-100M of triples.
This operation does not have to be fast, but it has to return complete results eventually.
huge +1 on this
The result of CONSTRUCT
is a serialized text, not a large set of short rows, so it becomes a single BLOB
stream in memory when queried via the SPARQL endpoint and this can consume all system memory for large result sets, which is why there is a limit on its size via the HTTP endpoint, but not via isql
which will automatically use a cursor for splitting the result into pages.
Question from Orri: "what other RDF Stores handles such large construct queries via there SPARQL endpoint?"
So, Orri suggests if you really want to use construct
, do this via one of the SQL client CLIs (i.e., ODBC, JDBC, Jena, Sesame, etc.) which can handle such large results. Or create an insert
query that writes the required results to a graph in virtuoso which can either be dumped to data sets and loaded into your application (UnifiedViews etc) or it can query them and feed into the processing pipeline.
Thanks for the explanation, that finally brings some light into this. The question is of course whether the construct query could not be implemented in another way... like for n-quads format, which could surely be stored as a table and not a BLOB
.
As to Orri's question, Apache Jena Fuseki running on Jena TDB works with these queries... specifically for this one returns a 4GB turtle file in 700 seconds.
On 8/14/14 6:08 AM, jakubklimek wrote:
We use Virtuoso in our tools UnifiedViews http://unifiedeviews.eu (part of LOD2 stack) and Payola http://live.payola.cz as a data source. These are ETL and anaytic tools, which usually download and transform large datasets using SPARQL. This operation does not have to be fast, but it has to return complete results eventually. As Virtuoso claims to be an RDF database, it seems strange to me that there would be a limit of reasonable number of triples to be returned.
Virtuoso doesn't have a hard-limit in regards to the size of a SPARQL query solution (result set). It has configurable parameters that enable you control control solution size in relation to query timeouts. Put differently, SPARQL query processing is like a quiz contest where Virtuoso is a contestant with a configurable timeout for answering a question. If the query solution's size and complexity exceeds the allotted time, it can simply provide a partial solution, which is indicated to you via HTTP response headers.
We work with datasets with nearly 1B triples and we need to do linking and transformations on them using SPARQL, which sometimes includes downloading large parts of them for processing elsewhere. Imagine a regular SQL database in a data warehouse... would it also be unreasonable to return 24M rows there?
In any DBMS 24 million records has to come from somewhere and end up somewhere, even if this is all within the operation of said DBMS. Virtuoso is no different i.e., you can import loads of data, transform it, and dispatch it somewhere else if you choose.
I think that there should not be such a limit for a database, which includes an RDF database. 10K triples limit is OK when you use the endpoint for browsing via faceted browser or some thing si milar, but it is not OK for transformations of whole datasets.
Why are you assuming that this is the case? Virtuoso has many interfaces (SPARQL and/or SQL).
On the other hand, if there really is some official number of triples that would be considered unreasonable for Virtuoso, please let me know so that I can look for another solution.
I don't know how you've arrived at these conclusions. Virtuoso is actually a sophisticated DBMS that happens to support SPARQL and SQL query languages, amongst many other capabilities. ETL isn't alien territory to Virtuoso which is also a very sophisticated Virtual DBMS and/or warehouse.
Also acceptable would be a corresponding setting in virtuoso.ini, if that would help something. There is VectorSize and MaxVectorSize but this does not affect this issue.
What is the issue?
To answer your question, the other query is supposed to return approx. 50-100M of triples.
You can import whatever you want, subject to configuration and host operating system and computing resources. You can transform what's inside Virtuoso, and export it wherever you want. They key thing is to understand what you are trying to achieve and how it is achieved reasonably.
On 8/15/14 6:53 AM, jakubklimek wrote:
Thanks for the explanation, that finally brings some light into this. The question is of course whether the construct query could not be implemented in another way... like for n-quads format, which could surely be stored as a table and not a BLOB.
As to Orri's question, Apache Jena Fuseki running on Jena TDB works with these queries... specifically for this one returns a 4GB turtle file in 700 seconds.
You can dump the same data from Virtuoso to a Named Graph or OS file (using the RDF dump feature).
I don't believe you simply execute a CONSTRUCT
against any SPARQL
compliant DB and get the output to your screen. It has to be routed to
some storage location (modulo your screen).
Outline your Jena TDB sequence (bearing in mind its Java and JDBC centricity) workflow, and we can respond accordingly, assuming the earlier response from Hugh/Orri is still unclear.
What is the issue?
@kidehen did you look at github page? the issue is this error: Virtuoso 22023 Error SR...: The result vector is too large
I don't believe you simply execute a CONSTRUCT against and SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).
the question is not about overcoming limitations — sure, there are some other ways. the question is about executing a standards-compliant query (sparql-query via sparql/http protocol and getting result). right now engine chokes with an error. it could stream the data instead
Virtuoso doesn't have a hard-limit in regards to the size of a SPARQL query solution (result set). It has configurable parameters that enable you control control solution size in relation to query timeouts. Put differently, SPARQL query processing is like a quiz contest where Virtuoso is a contestant with a configurable timeout for answering a question. If the query solution's size and complexity exceeds the allotted time, it can simply provide a partial solution, which is indicated to you via HTTP response headers.
@kidehen That is exactly the issue. Here we hit some unspecified limit on "vector size" that cannot be set anywhere.
In any DBMS 24 million records has to come from somewhere and end up somewhere, even if this is all within the operation of said DBMS. Virtuoso is no different i.e., you can import loads of data, transform it, and dispatch it somewhere else if you choose.
Yes, but here we are talking about doing this using SPARQL and it seems that Virtuoso currently has issues with that. I know I can dump the data, query using SQL etc. but we are still talking about SPARQL, because these transformations should be independent of any particular triplestore implementation and hence done using SPARQL - that is what it is here for.
What is the issue?
See the original git issue. It is that the query evaluation crashes on some limit that cannot be set anywhere and it is not clear what exactly it is.
I don't believe you simply execute a CONSTRUCT against and SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).
Nevertheless, it is the case. I execute the query using wget, accessing Jena's SPARQL HTTP endpoint (Fuseki) and the solution is streamed and stored to a file.
On 8/15/14 10:29 AM, Alexey Zakhlestin wrote:
What is the issue?
@kidehen https://github.com/kidehen did you look at github page? the issue is this error:
Virtuoso 22023 Error SR...: The result vector is too large
It is too large for the default output target. Ideally, SPARQL would include an option to output to variety of target, which would open up the door for streams, etc., but that option doesn't exist, so we all implement workarounds.
I don't believe you simply execute a CONSTRUCT against and SPARQL compliant DB and get the output to your screen. It has to be routed to some storage location (modulo your screen).
the question is not about overcoming limitations — sure, there are some other ways. the question is about executing a standards-compliant query (sparql-query via sparql/http protocol and getting result).
If you have a SPARQL Protocol URL comprised of a CONSTRUCT
that results
in output, of the magnitude requested, that works, then show me and it
will be implemented.
Simple example:
curl -O {SPARQL-Protocol-URL-for-Construct-returning-what-is-sought-here} .
right now engine chokes with an error. it could stream the data instead
Hugh/Orri/Ivan/Mitko:
I should be able to chunk the data via HTTP in regards to the example above. It can even gzip the chunked output. That's just about an internal TCN rule scoped to this kind of request.
If you have a SPARQL Protocol URL comprised of a CONSTRUCT that results in output, of the magnitude requested, that works, then show me and it will be implemented.
this gets you a 4GB ttl file from Jena TDB + Fuseki:
wget http://v7.xrg.cz:3030/mfcr/sparql?query=CONSTRUCT+%0A++%7B+%3Fs+%3Fp+%3Fo+.%7D%0AFROM+%3Chttp%3A%2F%2Flinked.opendata.cz%2Fresource%2Fdataset%2Fmfcr%2Frozvaha%3E%0AWHERE%0A++%7B+%3Fs+%3Fp+%3Fo+%7D%0A
And the same query on the same data using Virtuoso gives you the error:
http://internal.opendata.cz:8901/sparql?query=CONSTRUCT+%0A++%7B+%3Fs+%3Fp+%3Fo+.%7D%0AFROM+%3Chttp%3A%2F%2Flinked.opendata.cz%2Fresource%2Fdataset%2Fmfcr%2Frozvaha%3E%0AWHERE%0A++%7B+%3Fs+%3Fp+%3Fo+%7D%0A
On 8/15/14 11:36 AM, jakubklimek wrote:
Nevertheless, it is the case. I execute the query using
wget
, accessing Jena's SPARQL HTTP endpoint (Fuseki) and the solution is streamed and stored to a file.
curl -O {sparql-protocol-url}
or wget {sparql-protocol-url}
should work.
Virtuoso supports gzipped content and chunking over HTTP, so this might just be a case of creating a default re-write rule based on a transparent content negotiation (TCN) algorithm.
On 8/15/14 11:48 AM, jakubklimek wrote:
If you have a SPARQL Protocol URL comprised of a CONSTRUCT that results in output, of the magnitude requested, that works, then show me and it will be implemented.
wget http://v7.xrg.cz:3030/mfcr/sparql?query=CONSTRUCT+%0A++%7B+%3Fs+%3Fp+%3Fo+.%7D%0AFROM+%3Chttp%3A%2F%2Flinked.opendata.cz%2Fresource%2Fdataset%2Fmfcr%2Frozvaha%3E%0AWHERE%0A++%7B+%3Fs+%3Fp+%3Fo+%7D%0A
As per my last comments, we need to make a TCN based re-write rule for a SPARQL URL that includes a new parameter (serving as the data streaming hint) for dispatching the query solution to the requestor (ideally compressed and delivered in chunks).
[1] chunking -- http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html
[2] see 14.3
for compressed data exchange headers -- http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Virtuoso's limit for vectors is ~2500000 items for 32-bit builds and 1250000 items for 64-bit builds. Thus, a really long CONSTRUCT
can run out of the limit for sure. SPARQL INSERT
and SPARQL DELETE
will not meet that limit because they don't have to get a sequence of things. CONSTRUCT
will also bypass the limit if an ODBC/UDBC/IODBC is used to execute a SPARQL CONSTRUCT
statement, not a web service endpoint: all these protocols will deliver the result of the CONSTRUCT
as result set of three columns S,P,O
(or four columns if it's SPARQL 1.1 CONSTRUCT
with GRAPH
ctor templates).
The error to fix is crash after the error, this clearly should not happen.
Virtuoso's limit for vectors is ~2500000 items for 32-bit builds and 1250000 items for 64-bit builds. Thus, a really long CONSTRUCT can run out of the limit for sure.
Does this mean that there is no way around this error and Virtuoso is unable to return more items through a SPARQL Endpoint while other triplestores can do that?
There is an internal ticket for development to provide a fix to enable such large results to be created via the /sparql
endpoint
Is there any progress with this? I'm facing a strange behavior that seems to be related to this issue. I have a named graph with 691461 triples. Attempts to export the complete graph always return only 10001 triples.
I've tried two ways:
construct {?s ?p ?o} from <http://data.cubiss.nl/muzieklijstjes/> where {?s ?p ?o}
P.S. Using Virtuoso version 07.20.3212
+1 to the status request on this topic.
Also running into this issue. Would be nice to get an update!
dump of a large graphs is supposed to be done using instructions in : http://docs.openlinksw.com/virtuoso/rdfperfdumpandreloadgraphs/
Unfortunately , the procedure seems to not exist:
*** Error 42001: VD [Virtuoso Server]SR185: Undefined procedure DB.DBA.dump_one_graph.
I tried creating it using the source in http://docs.openlinksw.com/virtuoso/rdfperfdumpandreloadgraphs/, however I then get the error:
*** Error 22023: VD [Virtuoso Server]SR601: Argument 0 of http_ttl_triple() should be an array of special format at line 110 of Top-Level: dump_one_graph('http://diogenes/proteomics_ms', './diogenes_proteomicsms', 1000000000)
Any further insights ?
please change references like this --
env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0);
-- to --
env := vector (dict_new (16000), 0, '', '', '', 0, 0, 0, 0, 0);
i.e., the env
have to have ten elements
Is there any progress here?
I get the following error now
Virtuoso RDFXX Error Content length 21561445 is over the limit 20971520
SPARQL query: define sql:big-data-const 0
output-format:text/html
define sql:signal-void-variables 1 load http://id.vlaanderen.be/vodap_validator/results/cache/Thursday/2e321f85fc0d0eaf28de8201ebb8a9c9.rdf into http://data.vlaanderen.be/id/dataset/2017-09-28T22:34:05Z
I assume this is the same issue?
Bert
Please ignore my previous message, I found the parameter in the docs. http://docs.openlinksw.com/virtuoso/dbadm/
MaxDataSourceSize = 20971520
. Controls the max size that can be sponged. Default is 20 MB.
@HughWilliams OK, 6 years later, and I am still running into this issue. Any progress?
Checking with development if they completed a fix for this issue as it was planned to be fixed ...
@HughWilliams any news regarding this? now it is 7 years...