netwerk-digitaal-erfgoed / ld-workbench

A CLI tool for transforming large RDF datasets using pure SPARQL.
4 stars 1 forks source link

Local SPARQL endpoint (Fuseki / Qlever) issues #98

Open coret opened 1 week ago

coret commented 1 week ago

I have loaded all NA photocollection N-triples (including the 3GB testfile "7") into my local (production) GraphDB and the LD Workbench works great.

The same iterator/generator doesn't work when I use the endpoint https://service.archief.nl/sparql: The Generator did not run successfully, it could not get the results from : Invalid SPARQL endpoint response from https://service.archief.nl/sparql (HTTP status 400)).

I thought this would also be a good moment to test out Qlever. But the LD Workbench generates a SPARQL query which Qlever can't handle (yet) and I don't see how to change the LD Workbench behaviour.

2024-06-27 12:00:06.842 - ERROR: Invalid SPARQL query: This parser currently doesn't support COUNT(*), please specify an explicit expression for the COUNT
2024-06-27 12:00:06.842 - ERROR: SELECT (COUNT(*) AS ?count) WHERE {
  SELECT ?this WHERE { ?this <http://www.openarchives.org/ore/terms/isAggregatedBy> <https://archief.nl/doc/2.10.62ntfoto>. }
  LIMIT 10
}

I've also tested Fuseki (v5), but this (same generator/iterator as used in above cases) ends in an out-of-memory message in the LD Workbench. Adding a batchSize doesn't help.

$ ./apache-jena-fuseki-5.0.0/fuseki-server --tdb2 --loc data/NA /na-fotocollectie

 17:59:37 INFO  Server          :: Running in read-only mode for /na-fotocollectie
17:59:37 INFO  Server          :: Apache Jena Fuseki 5.0.0
17:59:37 WARN  ServletContextHandler :: BaseResource file:///home/http/fuseki.coret.org/./apache-jena-fuseki-5.0.0/webapp/ is aliased to file:///home/http/fuseki.coret.org/apache-jena-fuseki-5.0.0/webapp/ in oeje10w.WebAppContext@45394b31{org.apache.jena.fuseki.Servlet,/,b=file:///home/http/fuseki.coret.org/./apache-jena-fuseki-5.0.0/webapp/,a=STOPPED,h=oeje10s.SessionHandler@1ec7d8b3{STOPPED}}. May not be supported in future releases.
17:59:37 WARN  ContextHandler  :: Base Resource should not be an alias
17:59:37 INFO  Config          :: FUSEKI_HOME=/home/http/fuseki.coret.org/./apache-jena-fuseki-5.0.0
17:59:37 INFO  Config          :: FUSEKI_BASE=/home/http/fuseki.coret.org/run
17:59:37 INFO  Config          :: Shiro file: file:///home/http/fuseki.coret.org/run/shiro.ini
17:59:37 INFO  Config          :: Template file: templates/config-tdb2-dir-readonly
17:59:38 INFO  Server          :: Database: TDB2 dataset: location=data/NA
17:59:38 INFO  Server          :: Path = /na-fotocollectie
17:59:38 INFO  Server          ::   Memory: 4,0 GiB
17:59:38 INFO  Server          ::   Java:   17.0.11
17:59:38 INFO  Server          ::   OS:     Linux 6.1.0-13-amd64 amd64
17:59:38 INFO  Server          ::   PID:    2022356
17:59:38 INFO  Server          :: Started 2024/07/01 17:59:38 CEST on port 3030

$ npx @netwerk-digitaal-erfgoed/ld-workbench@latest -p "NA fotocollectie (via SPARQL)" -s 7-000
Welcome to LD Workbench version 2.4.2
▶ Starting pipeline “NA fotocollectie (via SPARQL)”
✔ Validating pipeline
⠼ Loading results from iterator
<--- Last few GCs --->

[2024610:0x62539b0]    62164 ms: Mark-sweep 4049.4 (4139.6) -> 4039.6 (4142.6) MB, 1293.3 / 0.0 ms  (average mu = 0.095, current mu = 0.012) task scavenge might not succeed
[2024610:0x62539b0]    64113 ms: Mark-sweep 4052.4 (4142.6) -> 4042.7 (4145.8) MB, 1933.1 / 0.0 ms  (average mu = 0.045, current mu = 0.008) task scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb09980 node::Abort() [node]
 2: 0xa1c235 node::FatalError(char const*, char const*) [node]
 3: 0xcf784e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xcf7bc7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xeaf465  [node]
 6: 0xebf12d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xf222f4 v8::internal::ScavengeJob::Task::RunInternal() [node]
 8: 0xdb59db non-virtual thunk to v8::internal::CancelableTask::Run() [node]
 9: 0xb77524 node::PerIsolatePlatformData::RunForegroundTask(std::unique_ptr<v8::Task, std::default_delete<v8::Task> >) [node]
10: 0xb79389 node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node]
11: 0x15633c6  [node]
12: 0x1575af4  [node]
13: 0x1563d18 uv_run [node]
14: 0xa43dd5 node::SpinEventLoop(node::Environment*) [node]
15: 0xb4bab6 node::NodeMainInstance::Run(node::EnvSerializeInfo const*) [node]
16: 0xacd3f2 node::Start(int, char**) [node]
17: 0x7fb7fe84624a  [/lib/x86_64-linux-gnu/libc.so.6]
18: 0x7fb7fe846305 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
19: 0xa4076c  [node]
Aborted

part of my LD-Workbench configuration:

 - name: "7-000"
    iterator:
      query: "SELECT * WHERE { ?this <http://www.openarchives.org/ore/terms/isAggregatedBy> <https://archief.nl/doc/7.000spaondntfoto> }"
      #endpoint: https://www.goudatijdmachine.nl/sparql/repositories/nafotocollectie
      endpoint: https://fuseki.coret.org/na-fotocollectie/
      #endpoint: https://service.archief.nl/sparql
      batchSize: 50
    generator: 
      -  query: file://generator.rq
         batchSize: 50
ddeboer commented 1 week ago

These are different issues combined.

- query: file://generator.rq

Can you share your generator query? Even better, push your config to the configurations repository and link to the dump file that you’re using.

But the LD Workbench generates a SPARQL query which Qlever can't handle (yet) and I don't see how to change the LD Workbench behaviour.

LD Workbench does not generate this query, so it’s probably Comunica. Running QLever seems way too much work, so I’m not going to reproduce this locally. Compare this to a simple oxigraph start.

coret commented 1 week ago

Can you share your generator query? Even better, push your config to the configurations repository and link to the dump file that you’re using.

See https://www.github.com/netwerk-digitaal-erfgoed/ld-workbench-configuration/tree/main/nafotos-sparql-endpoint for config and https://nde-europeana.ams3.cdn.digitaloceanspaces.com/7-000spaondntfoto.2.zip for a big part of the NA photocollection (for the 7-000 stage).

Compare this to a simple oxigraph start.

Have not tries oxigraph yet, will do!

coret commented 1 week ago

Have not tries oxigraph yet, will do!

Have not tried your code in PR 99 but tried to start and import the 2.8GB N-triple file directly:

$docker run --rm -v ./data:/data -p 7878:7878 oxigraph/oxigraph --location /data serve --bind 0.0.0.0:7878
$curl -f -X POST http://localhost:7878/store?default -H 'Content-Type:application/n-triples' --data-binary "@7-000spaondntfoto.3.nt"
curl: option --data-binary: out of memory
curl: try 'curl --help' or 'curl --manual' for more information

Hope your src/import.ts won't be bothered by the big filesize.

ddeboer commented 1 week ago

Have not tries oxigraph yet, will do!

Have not tried your code in PR 99 but tried to start and import the 2.8GB N-triple file directly:

$docker run --rm -v ./data:/data -p 7878:7878 oxigraph/oxigraph --location /data serve --bind 0.0.0.0:7878
$curl -f -X POST http://localhost:7878/store?default -H 'Content-Type:application/n-triples' --data-binary "@7-000spaondntfoto.3.nt"
curl: option --data-binary: out of memory
curl: try 'curl --help' or 'curl --manual' for more information

Hope your src/import.ts won't be bothered by the big filesize.

Use curl -T instead to stream the file instead of loading the whole file in memory. The new import feature is streaming as well. Just not sure yet about the best YAML config conventions for it.

ddeboer commented 6 days ago

Your query returns no results, so I cannot test your pipeline. Please provide a ready-to-go reproducer.