netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register
https://datasetregister.netwerkdigitaalerfgoed.nl/api/
European Union Public License 1.2
4 stars 3 forks source link

OOM during validation #384

Open ddeboer opened 2 years ago

ddeboer commented 2 years ago

Validating https://archief.nl/doc/datacatalog consumes > 2.5 GB RAM, causing OOM on production.

coret commented 2 years ago

I performed a capacity test to see with what size file (or number of triples / datasets) the validation fails.

Note 1: all Turtle files are UTF-8, generated on the knowledge openarch.nl has of these datacatalogs. Note 2: all files have been checked with Apache Jena's shacl validate and via https://github.com/zazuko/rdf-validate-shacl (used by the Dataset Register) against the Dataset Register SHACL. Note 3: this test focussed on the validation part, these datasets have not been added yet to the Dataset Register.

The table below show the (reproducible) results. It seems a datacatalog with 7612 datasets (srt) is too much, the validation API gives a time-out (504).

What needs to be investigated is why the gra results in a 406 (no datasets found, log below), and rar and hga result in a 502 (bad gateway).

org triples datasets filesize shacl validate result test
dom 325 21 15057 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/dom.ttl"}'
krd 2066 137 97154 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/krd.ttl"}'
snv 2838 180 126467 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/snv.ttl"}'
rmd 2871 181 138844 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rmd.ttl"}'
arg 3139 196 150234 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/arg.ttl"}'
swl 4147 259 184978 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/swl.ttl"}'
rht 6684 419 310376 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rht.ttl"}'
wat 6969 435 317683 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/wat.ttl"}'
nle 10366 648 469500 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/nle.ttl"}'
vev 17051 1093 813380 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/vev.ttl"}'
svp 19125 1228 875834 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/svp.ttl"}'
gaz 25964 1660 1212331 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/gaz.ttl"}'
eal 38508 2473 1814065 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/eal.ttl"}'
rad 41907 2668 1933876 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rad.ttl"}'
gra 42549 2829 1948773 VALID 406 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/gra.ttl"}'
wfa 63421 3972 2928936 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/wfa.ttl"}'
rar 65550 4122 3023339 VALID 502 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rar.ttl"}'
rhe 81747 5111 4040792 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rhe.ttl"}'
hga 90143 5649 4292578 VALID 502 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/hga.ttl"}'
nha 93216 5879 4332113 VALID 200 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/nha.ttl"}'
srt 114190 7612 5480117 VALID 504 curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/srt.ttl"}'
gld 194406 12796 8841888 VALID   curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/gld.ttl"}'
hua 593085 37332 26518728 VALID   curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' -H 'content-type: application/ld+json' --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/hua.ttl"}'

Log entry for gra's 406:

{"level":30,"time":1639749427946,"pid":19,"hostname":"registry-api-6d9f5b9df6-mvbvw","reqId":"req-5","msg":"No dataset found at URL https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/gra.ttl: No dataset found at URL: https://linkedsoftwaredependencies.org/bundles/npm/rdf-dereference/^1.0.0/config/config-default.json#mediatorRdfParseHandle mediated over all rejecting actors:\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream"}
{"level":30,"time":1639749427948,"pid":19,"hostname":"registry-api-6d9f5b9df6-mvbvw","reqId":"req-5","res":{"statusCode":406},"responseTime":266.2652578353882,"msg":"request completed"}
ddeboer commented 2 years ago

What does VALID mean? I’m getting a 400 for a random registration URL such as https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rad.ttl.

One way to limit validation memory usage is setting maxErrors. A downside would be that we no longer guarantee that all validation errors are returned. As a result of this, a client may solve all current errors, validate again, and then get new errors.

coret commented 2 years ago

VALID just indicates the result of local validation (shacl validate and rdf-validate-shacl).

The following test (from the test column of entry rad) gives me a HTTP 200 (and a lot of JSON-LD): curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' \ -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' \ -H 'content-type: application/ld+json' \ --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rad.ttl"}'

I think from this capacity test we determine 1000 as a safe/nice amount for the maximum number of datasets in a data catalog. Above this number, paging (via hydra) should be used. The next step is to see if number holds for the "post dataset" function.