Update huge data (in jena-fuseki) ~ the recommended way?

infinite-dao commented 1 year ago

How is the recommended way to import huge data on an existing SPARQL endpoint?

From understanding README with load.sh (version Jun 12, 2022), I think this is meant to import data only for creation, not for updating. Right?

I tried out successfully:

SOH - SPARQL over HTTP (https://jena.apache.org/documentation/fuseki2/soh.html, needs package ruby-full inside the docker container to use s-post (appending data), s-put (overwriting everything) aso.). This works on a running Fuseki server, but it is slow; I recommend to import smaller sized files of triple files, e.g. 200-50MB uncompressed depending on processing power
tdbloader2 wrapper for creating a dataset structure; I used files of split data of 4GB uncompressed triple files; it is much faster but complicate to ensure to decouple the Fuseki server from the data set that is worked on

Using tdbloader2 to update an existing database I have not tested yet, and I do not know how to switch off the Fuseki server successfully and keep the container running. When I stopped the Fuseki server process ID from within the docker container, the whole container was shut down and it was not possible to reach the tdbloader2 that way.

After trying out several possibilities—with a public SPARQL endpoint to keep intact— the safest recreation of an entire data set could be (in theory):

have the old/broken dataset (i.e. SPARQL database) still running with the public SPARQL endpoint
create in parallel a new dataset structure using wrapper tdbloader2 and create all anew but to a different database location place (tdbloader2 --loc '…')
- when this is done, switch the configuration of the public SPARQL endpoint manually to read from the new dataset structure

Recreate all backup data

I had to recreate the entire data because of a data indexing error (see lists.apache.org thread “MOVE GRAPH … TO GRAPH … ~  Server Error 500 Iterator: started at 5, now 6 (SPARQL Update)”)—and one solution I came up with, is to use the tdbloader2 wrapper for creating the first data structure (I think, in theory, it should work also on an existing data structure but with fuseki server switched off, which is the problem to get fuseki server switched off while having the docker container keep running).

In my case I had to reimport the entire backup of 8GB zipped triple data (about 1,300,000,000 triples), to recreate the data structure properly; it failed to import the whole 8GB of zipped data, but splitting the triple data into smaller pieces succeeded. What I did the wrong way, after backup the data first, was to delete the public SPARQL endpoint (it was intended to delete the data structure, but would be better to run the SPARQL endpoint on some data until the update or new creation is finished). What I did, was:

backup all data using Fuseki UI (8GB gzipped, 1,300,000,000 triple data)
split backup file (backup was split into equal sizes of ~4GB unzipped data and became 52 files of unzipped triple data files)
I did it wrong: delete intentionally the SPARQL end point with the old data
let wrapper tdbloader2 run on these split files—it took 4 days 7 hours to complete all indexing of 1,3 bn triples
create the corresponding data set name (persistent TDB2 – dataset, i.e. actually the database configuration file *.ttl) with the Fuseki UI, it will link the data to the SPARQL endpoint

Comments to step 2. (splitting):

BACKUPPATH=/opt/jena-fuseki/import-sandbox/backups

cd "${BACKUPPATH}"

# using real backup data to split into unzipped 4GB partial files
# append extension: ….nq  
# compress the split files
gunzip < backup_20230123.nq.gz | \
  split --line-bytes=4G \
  --additional-suffix=.nq \
  --filter='echo -n compress $FILE; gzip --verbose > $FILE.gz' \
  - backup_20230123_split-

Comments to step 4. (`tdbloader2`):

The setup was something like https://github.com/stain/jena-docker/issues/70#issuecomment-1380536645, i.e.

one docker container managing the data (fuseki-data)
one docker container (fuseki-app) with linked volumes for running the Fuseki server

And the command to run tdbloader2 was something like:

DATASET=CETAF-IDs
# in the docker container “fuseki-app” the /fuseki/databases/${DATASET} was made empty

# run the wrapper tdbloader2 on the running docker container fuseki-app
docker exec -it fuseki-app  /bin/bash -c "/jena-fuseki/tdbloader2 --loc /fuseki/databases/${DATASET} \
 /container-path-to-data-backups/{backup_20230123_split-aa.nq.gz,backup_20230123_split-ab.nq.gz\
,backup_20230123_split-ac.nq.gz,backup_20230123_split-ad.nq.gz\
,backup_20230123_split-ae.nq.gz,backup_20230123_split-af.nq.gz\
,backup_20230123_split-ag.nq.gz,backup_20230123_split-ah.nq.gz\
,backup_20230123_split-ai.nq.gz,backup_20230123_split-aj.nq.gz\
,backup_20230123_split-ak.nq.gz,backup_20230123_split-al.nq.gz\
,backup_20230123_split-am.nq.gz,backup_20230123_split-an.nq.gz\
,backup_20230123_split-ao.nq.gz,backup_20230123_split-ap.nq.gz\
,backup_20230123_split-aq.nq.gz,backup_20230123_split-ar.nq.gz\
,backup_20230123_split-as.nq.gz,backup_20230123_split-at.nq.gz\
,backup_20230123_split-au.nq.gz,backup_20230123_split-av.nq.gz\
,backup_20230123_split-aw.nq.gz,backup_20230123_split-ax.nq.gz\
,backup_20230123_split-ay.nq.gz,backup_20230123_split-az.nq.gz\
,backup_20230123_split-ba.nq.gz,backup_20230123_split-bb.nq.gz\
,backup_20230123_split-bc.nq.gz,backup_20230123_split-bd.nq.gz\
,backup_20230123_split-be.nq.gz,backup_20230123_split-bf.nq.gz\
,backup_20230123_split-bg.nq.gz,backup_20230123_split-bh.nq.gz\
,backup_20230123_split-bi.nq.gz,backup_20230123_split-bj.nq.gz\
,backup_20230123_split-bk.nq.gz,backup_20230123_split-bl.nq.gz\
,backup_20230123_split-bm.nq.gz,backup_20230123_split-bn.nq.gz\
,backup_20230123_split-bo.nq.gz,backup_20230123_split-bp.nq.gz\
,backup_20230123_split-bq.nq.gz,backup_20230123_split-br.nq.gz\
,backup_20230123_split-bs.nq.gz,backup_20230123_split-bt.nq.gz\
,backup_20230123_split-bu.nq.gz,backup_20230123_split-bv.nq.gz\
,backup_20230123_split-bw.nq.gz,backup_20230123_split-bx.nq.gz\
,backup_20230123_split-by.nq.gz,backup_20230123_split-bz.nq.gz}" \
 | tee --append /opt/jena-fuseki/import-sandbox/import_backup_${DATASET}_$(date '+%Y%m%d-%H%M')_split.nq.gz.log

infinite-dao commented 1 year ago

Question

So in the end the question remains: How is the recommended way to import huge data on an existing SPARQL endpoint?

I mean splitting data is one thing, of course to get it working
But how do it properly on an existing database (e.g. with tdbloader2)?

One critical point is also—the crux of the matter:

How is it possible to switch off, or terminate properly the Fuseki server without shutting down the docker container?

infinite-dao commented 1 year ago

Just to note that it was not possible to update data on a currently running docker container (as one could perhaps suggest from reading README.md) — the example I tried out was on a test server with the setup:

docker container conftest-fuseki-data (having the data)
docker container conftest-fuseki-app (running the server, volumes where linked similar to the setup in comment https://github.com/stain/jena-docker/issues/70#issuecomment-1380536645 )

docker exec -it conftest-fuseki-app  /bin/bash -c '/jena-fuseki/tdbloader2 \
  --loc /fuseki/databases/CETAF-IDs-graphs-mixed-and-default  \
  /import-data/rdf/BGBM/Test_Thread-09_herbarium.bgbm.org_20221110-1227_normalized_with_GRAPH.nq.gz'
# org.apache.jena.dboe.DBOpEnvException: Failed to get a lock: file='/fuseki/databases/CETAF-IDs-graphs-mixed-and-default/tdb.lock': held by process 10

So the lock file was in the way which is alright because, as I have understood it from reading, there can only one process that accesses the data base, either tdbloader, tdbloader2 or the public SPARQL end point (=fuseki-server). The command that starts the server is executed from the Dockerfile → /jena-fuseki/fuseki-server which comes from the original fuseki tar.gz-file.

stain / jena-docker