netwerk-digitaal-erfgoed / ld-workbench

A CLI tool for transforming large RDF datasets using pure SPARQL.
5 stars 1 forks source link

HTTP status 400 on Iterator with GraphDB #44

Closed coret closed 5 months ago

coret commented 8 months ago

I get a HTTP status 400 error (Missing parameter: query) when I use an GraphDB endpoint. No problem when I use the Jena endpoint https://service.archief.nl/sparql

$ comunica-sparql https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry -l debug -q "SELECT * WHERE { ?this <http://www.openarchives.org/ore/terms/isAggregatedBy> <https://archief.nl/doc/2.10.62ntfoto> } LIMIT 10" -i sparql
[2024-03-12T17:36:06.661Z]  INFO: Requesting https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry {
  headers: {
    accept: 'application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,application/json;q=0.45,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18',
    'user-agent': 'Comunica/actor-http-node-fetch (Node.js v16.14.2; linux)'
  },
  method: 'GET',
  actor: 'https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-init-sparql/^1.0.0/config/sets/http.json#myHttpFetcher'
}
[[2024-03-12T17:36:06.761Z]  INFO: Identified as file source: https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry {
  actor: 'https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-init-sparql/^1.0.0/config/sets/resolve-hypermedia.json#myHypermediaNoneResolver'
}
Could not retrieve https://triplestore.netwerkdigitaalerfgoed.nl/repositories/registry (HTTP status 400):
Missing parameter: query

When I use the tip from @rubensworks in https://github.com/comunica/comunica/issues/962 (prepend sparql@ to the endpoint URL) the query does return results:

comunica-sparql sparql@https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie -l debug -q "SELECT * WHERE { ?this <http://www.openarchives.org/ore/terms/isAggregatedBy> <https://archief.nl/doc/2.10.62ntfoto> } LIMIT 10"
[2024-03-12T17:37:50.972Z]  INFO: Requesting https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie {
  headers: {
    accept: 'application/sparql-results+json;q=1.0,application/sparql-results+xml;q=0.7',
    'content-length': '178',
    'content-type': 'application/x-www-form-urlencoded',
    'user-agent': 'Comunica/actor-http-node-fetch (Node.js v16.14.2; linux)'
  },
  method: 'POST',
  actor: 'https://linkedsoftwaredependencies.org/bundles/npm/@comunica/actor-init-sparql/^1.0.0/config/sets/http.json#myHttpFetcher'
}
[
{"?this":"https://archief.nl/doc/fotorecord/0605f6c6-ced2-41f9-f1b6-f77a3b075ec3"},
{"?this":"https://archief.nl/doc/fotorecord/0934841b-c2a2-bb3c-cc98-278b9f8975e5"},
{"?this":"https://archief.nl/doc/fotorecord/09c03895-aa07-741b-e0a7-c128452797b2"},
{"?this":"https://archief.nl/doc/fotorecord/0bc23fe5-379b-2c33-d3b5-ba2ab391cd47"},
{"?this":"https://archief.nl/doc/fotorecord/10837478-2235-3862-9e0e-000ab5dc9a58"},
{"?this":"https://archief.nl/doc/fotorecord/1cca76ab-33ca-8b92-462b-f1b1179f92c5"},
{"?this":"https://archief.nl/doc/fotorecord/1e426c5f-4128-49b6-1964-f6685147df2d"},
{"?this":"https://archief.nl/doc/fotorecord/29feaab9-433f-65bf-0ebd-f7bf2dc91b9c"},
{"?this":"https://archief.nl/doc/fotorecord/2a8019ab-b970-8a64-1c64-5f6bcb8a4dbc"},
{"?this":"https://archief.nl/doc/fotorecord/33398b2f-098a-7b42-273d-1a6ad39724ef"}
]

However, I cannot specify endpoint: sparql@https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie in my ld-workbench config.yaml:

Error in the iterator of stage `Stage 1`: "sparql@https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie" is not a valid URL

Should the ld-workbench use Comunica in another way?

mightymax commented 5 months ago

If this is indeed a workaround for GraphDB, I think the best solution is to change the check for valid URL's and add a unit test for this case. The solution would then be to strip sparql@ from the endpoint before testing the validity of the URL.

github-actions[bot] commented 5 months ago

:tada: This issue has been resolved in version 1.0.2 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

coret commented 5 months ago

I'm using endpoint: sparql@https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie in my config (on the nafoto pipeline) but I immediately get an error:

The Iterator did not run succesfully, it could not get the results from the endpoint
https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie (offset: 0, limit 10): Could not retrieve
https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie (HTTP status 400)

The GraphDB logfile show the following error:

[INFO ] 2024-05-29 20:45:23,181 [repositories/nafotocollectie | c.o.f.s.GraphDBProtocolExceptionResolver] 
Client sent bad request (400)
org.eclipse.rdf4j.http.server.ClientHTTPException: Missing parameter: query
ddeboer commented 5 months ago

@coret Fixed. Please remove the sparql@ prefix from your config as that is now automatically done.

github-actions[bot] commented 5 months ago

:tada: This issue has been resolved in version 1.4.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket: