Open infinite-dao opened 1 year ago
So in the end the question remains: How is the recommended way to import huge data on an existing SPARQL endpoint?
tdbloader2
)?One critical point is also—the crux of the matter:
Just to note that it was not possible to update data on a currently running docker container (as one could perhaps suggest from reading README.md) — the example I tried out was on a test server with the setup:
conftest-fuseki-data
(having the data)conftest-fuseki-app
(running the server, volumes where linked similar to the setup in comment https://github.com/stain/jena-docker/issues/70#issuecomment-1380536645 )docker exec -it conftest-fuseki-app /bin/bash -c '/jena-fuseki/tdbloader2 \
--loc /fuseki/databases/CETAF-IDs-graphs-mixed-and-default \
/import-data/rdf/BGBM/Test_Thread-09_herbarium.bgbm.org_20221110-1227_normalized_with_GRAPH.nq.gz'
# org.apache.jena.dboe.DBOpEnvException: Failed to get a lock: file='/fuseki/databases/CETAF-IDs-graphs-mixed-and-default/tdb.lock': held by process 10
So the lock file was in the way which is alright because, as I have understood it from reading, there can only one process that accesses the data base, either tdbloader, tdbloader2 or the public SPARQL end point (=fuseki-server). The command that starts the server is executed from the Dockerfile
→ /jena-fuseki/fuseki-server
which comes from the original fuseki tar.gz-file.
How is the recommended way to import huge data on an existing SPARQL endpoint?
From understanding README with
load.sh
(version Jun 12, 2022), I think this is meant to import data only for creation, not for updating. Right?I tried out successfully:
ruby-full
inside the docker container to uses-post
(appending data),s-put
(overwriting everything) aso.). This works on a running Fuseki server, but it is slow; I recommend to import smaller sized files of triple files, e.g. 200-50MB uncompressed depending on processing powertdbloader2
wrapper for creating a dataset structure; I used files of split data of 4GB uncompressed triple files; it is much faster but complicate to ensure to decouple the Fuseki server from the data set that is worked onUsing
tdbloader2
to update an existing database I have not tested yet, and I do not know how to switch off the Fuseki server successfully and keep the container running. When I stopped the Fuseki server process ID from within the docker container, the whole container was shut down and it was not possible to reach thetdbloader2
that way.After trying out several possibilities—with a public SPARQL endpoint to keep intact— the safest recreation of an entire data set could be (in theory):
tdbloader2
and create all anew but to a different database location place (tdbloader2 --loc '…'
)Recreate all backup data
I had to recreate the entire data because of a data indexing error (see lists.apache.org thread “MOVE GRAPH … TO GRAPH … ~ Server Error 500 Iterator: started at 5, now 6 (SPARQL Update)”)—and one solution I came up with, is to use the
tdbloader2
wrapper for creating the first data structure (I think, in theory, it should work also on an existing data structure but with fuseki server switched off, which is the problem to get fuseki server switched off while having the docker container keep running).In my case I had to reimport the entire backup of 8GB zipped triple data (about 1,300,000,000 triples), to recreate the data structure properly; it failed to import the whole 8GB of zipped data, but splitting the triple data into smaller pieces succeeded. What I did the wrong way, after backup the data first, was to delete the public SPARQL endpoint (it was intended to delete the data structure, but would be better to run the SPARQL endpoint on some data until the update or new creation is finished). What I did, was:
tdbloader2
run on these split files—it took 4 days 7 hours to complete all indexing of 1,3 bn triplesComments to step 2. (splitting):
Comments to step 4. (
tdbloader2
):The setup was something like https://github.com/stain/jena-docker/issues/70#issuecomment-1380536645, i.e.
fuseki-data
)fuseki-app
) with linked volumes for running the Fuseki serverAnd the command to run
tdbloader2
was something like: