stain / jena-docker

Docker image for Apache Jena riot
Apache License 2.0
99 stars 86 forks source link

Update huge data (in jena-fuseki) ~ the recommended way? #74

Open infinite-dao opened 1 year ago

infinite-dao commented 1 year ago

How is the recommended way to import huge data on an existing SPARQL endpoint?

From understanding README with load.sh (version Jun 12, 2022), I think this is meant to import data only for creation, not for updating. Right?

I tried out successfully:

Using tdbloader2 to update an existing database I have not tested yet, and I do not know how to switch off the Fuseki server successfully and keep the container running. When I stopped the Fuseki server process ID from within the docker container, the whole container was shut down and it was not possible to reach the tdbloader2 that way.

After trying out several possibilities—with a public SPARQL endpoint to keep intact— the safest recreation of an entire data set could be (in theory):

Recreate all backup data

I had to recreate the entire data because of a data indexing error (see lists.apache.org thread “MOVE GRAPH … TO GRAPH … ~  Server Error 500 Iterator: started at 5, now 6 (SPARQL Update)”)—and one solution I came up with, is to use the tdbloader2 wrapper for creating the first data structure (I think, in theory, it should work also on an existing data structure but with fuseki server switched off, which is the problem to get fuseki server switched off while having the docker container keep running).

In my case I had to reimport the entire backup of 8GB zipped triple data (about 1,300,000,000 triples), to recreate the data structure properly; it failed to import the whole 8GB of zipped data, but splitting the triple data into smaller pieces succeeded. What I did the wrong way, after backup the data first, was to delete the public SPARQL endpoint (it was intended to delete the data structure, but would be better to run the SPARQL endpoint on some data until the update or new creation is finished). What I did, was:

  1. backup all data using Fuseki UI (8GB gzipped, 1,300,000,000 triple data)
  2. split backup file (backup was split into equal sizes of ~4GB unzipped data and became 52 files of unzipped triple data files)
  3. I did it wrong: delete intentionally the SPARQL end point with the old data
  4. let wrapper tdbloader2 run on these split files—it took 4 days 7 hours to complete all indexing of 1,3 bn triples
  5. create the corresponding data set name (persistent TDB2 – dataset, i.e. actually the database configuration file *.ttl) with the Fuseki UI, it will link the data to the SPARQL endpoint

Comments to step 2. (splitting):

BACKUPPATH=/opt/jena-fuseki/import-sandbox/backups

cd "${BACKUPPATH}"

# using real backup data to split into unzipped 4GB partial files
# append extension: ….nq  
# compress the split files
gunzip < backup_20230123.nq.gz | \
  split --line-bytes=4G \
  --additional-suffix=.nq \
  --filter='echo -n compress $FILE; gzip --verbose > $FILE.gz' \
  - backup_20230123_split-

Comments to step 4. (tdbloader2):

The setup was something like https://github.com/stain/jena-docker/issues/70#issuecomment-1380536645, i.e.

And the command to run tdbloader2 was something like:

DATASET=CETAF-IDs
# in the docker container “fuseki-app” the /fuseki/databases/${DATASET} was made empty

# run the wrapper tdbloader2 on the running docker container fuseki-app
docker exec -it fuseki-app  /bin/bash -c "/jena-fuseki/tdbloader2 --loc /fuseki/databases/${DATASET} \
 /container-path-to-data-backups/{backup_20230123_split-aa.nq.gz,backup_20230123_split-ab.nq.gz\
,backup_20230123_split-ac.nq.gz,backup_20230123_split-ad.nq.gz\
,backup_20230123_split-ae.nq.gz,backup_20230123_split-af.nq.gz\
,backup_20230123_split-ag.nq.gz,backup_20230123_split-ah.nq.gz\
,backup_20230123_split-ai.nq.gz,backup_20230123_split-aj.nq.gz\
,backup_20230123_split-ak.nq.gz,backup_20230123_split-al.nq.gz\
,backup_20230123_split-am.nq.gz,backup_20230123_split-an.nq.gz\
,backup_20230123_split-ao.nq.gz,backup_20230123_split-ap.nq.gz\
,backup_20230123_split-aq.nq.gz,backup_20230123_split-ar.nq.gz\
,backup_20230123_split-as.nq.gz,backup_20230123_split-at.nq.gz\
,backup_20230123_split-au.nq.gz,backup_20230123_split-av.nq.gz\
,backup_20230123_split-aw.nq.gz,backup_20230123_split-ax.nq.gz\
,backup_20230123_split-ay.nq.gz,backup_20230123_split-az.nq.gz\
,backup_20230123_split-ba.nq.gz,backup_20230123_split-bb.nq.gz\
,backup_20230123_split-bc.nq.gz,backup_20230123_split-bd.nq.gz\
,backup_20230123_split-be.nq.gz,backup_20230123_split-bf.nq.gz\
,backup_20230123_split-bg.nq.gz,backup_20230123_split-bh.nq.gz\
,backup_20230123_split-bi.nq.gz,backup_20230123_split-bj.nq.gz\
,backup_20230123_split-bk.nq.gz,backup_20230123_split-bl.nq.gz\
,backup_20230123_split-bm.nq.gz,backup_20230123_split-bn.nq.gz\
,backup_20230123_split-bo.nq.gz,backup_20230123_split-bp.nq.gz\
,backup_20230123_split-bq.nq.gz,backup_20230123_split-br.nq.gz\
,backup_20230123_split-bs.nq.gz,backup_20230123_split-bt.nq.gz\
,backup_20230123_split-bu.nq.gz,backup_20230123_split-bv.nq.gz\
,backup_20230123_split-bw.nq.gz,backup_20230123_split-bx.nq.gz\
,backup_20230123_split-by.nq.gz,backup_20230123_split-bz.nq.gz}" \
 | tee --append /opt/jena-fuseki/import-sandbox/import_backup_${DATASET}_$(date '+%Y%m%d-%H%M')_split.nq.gz.log
infinite-dao commented 1 year ago

Question

So in the end the question remains: How is the recommended way to import huge data on an existing SPARQL endpoint?

One critical point is also—the crux of the matter:

infinite-dao commented 1 year ago

Just to note that it was not possible to update data on a currently running docker container (as one could perhaps suggest from reading README.md) — the example I tried out was on a test server with the setup:

docker exec -it conftest-fuseki-app  /bin/bash -c '/jena-fuseki/tdbloader2 \
  --loc /fuseki/databases/CETAF-IDs-graphs-mixed-and-default  \
  /import-data/rdf/BGBM/Test_Thread-09_herbarium.bgbm.org_20221110-1227_normalized_with_GRAPH.nq.gz'
# org.apache.jena.dboe.DBOpEnvException: Failed to get a lock: file='/fuseki/databases/CETAF-IDs-graphs-mixed-and-default/tdb.lock': held by process 10

So the lock file was in the way which is alright because, as I have understood it from reading, there can only one process that accesses the data base, either tdbloader, tdbloader2 or the public SPARQL end point (=fuseki-server). The command that starts the server is executed from the Dockerfile/jena-fuseki/fuseki-server which comes from the original fuseki tar.gz-file.