Open ddeboer opened 2 years ago
I performed a capacity test to see with what size file (or number of triples / datasets) the validation fails.
Note 1: all Turtle files are UTF-8, generated on the knowledge openarch.nl has of these datacatalogs.
Note 2: all files have been checked with Apache Jena's shacl validate
and via https://github.com/zazuko/rdf-validate-shacl (used by the Dataset Register) against the Dataset Register SHACL.
Note 3: this test focussed on the validation part, these datasets have not been added yet to the Dataset Register.
The table below show the (reproducible) results. It seems a datacatalog with 7612 datasets (srt) is too much, the validation API gives a time-out (504).
What needs to be investigated is why the gra results in a 406 (no datasets found, log below), and rar and hga result in a 502 (bad gateway).
Log entry for gra's 406:
{"level":30,"time":1639749427946,"pid":19,"hostname":"registry-api-6d9f5b9df6-mvbvw","reqId":"req-5","msg":"No dataset found at URL https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/gra.ttl: No dataset found at URL: https://linkedsoftwaredependencies.org/bundles/npm/rdf-dereference/^1.0.0/config/config-default.json#mediatorRdfParseHandle mediated over all rejecting actors:\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream\nUnrecognized media type: application/octet-stream"}
{"level":30,"time":1639749427948,"pid":19,"hostname":"registry-api-6d9f5b9df6-mvbvw","reqId":"req-5","res":{"statusCode":406},"responseTime":266.2652578353882,"msg":"request completed"}
What does VALID
mean? I’m getting a 400 for a random registration URL such as https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rad.ttl.
One way to limit validation memory usage is setting maxErrors
. A downside would be that we no longer guarantee that all validation errors are returned. As a result of this, a client may solve all current errors, validate again, and then get new errors.
VALID just indicates the result of local validation (shacl validate and rdf-validate-shacl).
The following test (from the test column of entry rad) gives me a HTTP 200 (and a lot of JSON-LD):
curl -i -X PUT 'https://datasetregister.netwerkdigitaalerfgoed.nl/api/datasets/validate' \ -H 'link: http://www.w3.org/ns/ldp#RDFSource; rel="type",http://www.w3.org/ns/ldp#Resource; rel="type"' \ -H 'content-type: application/ld+json' \ --data-binary '{"@id":"https://raw.githubusercontent.com/netwerk-digitaal-erfgoed/dataset-register-entries/main/ANL/rad.ttl"}'
I think from this capacity test we determine 1000 as a safe/nice amount for the maximum number of datasets in a data catalog. Above this number, paging (via hydra) should be used. The next step is to see if number holds for the "post dataset" function.
Validating https://archief.nl/doc/datacatalog consumes > 2.5 GB RAM, causing OOM on production.