The Europeana-pipeline is a continuation of the LOD aggregator project. The goal is to investigate if a linked data pipeline can be designed that can harvest linked data from different sources and convert the information to the Europeana Data Model (EDM) and make the ingest to the Europeana harvesting platform possible. The goal of this project is to expand the LOD aggregator and investigate how the pipeline could work in an automated CI/CD process.
The architecture of the has been designed with modularity in mind as well as the use of as many standardized and already available tooling as possible to reduce the amount of custom build code. The architecture of the pipeline is visible in figure 1.
For a generic transformation process the following tasks will be performed(All of these steps are also numbered in figure 1):
Steps 1-4 and 6 have been written with RATT, a TypeScript library designed for handling large data transformations. Please see the RATT online documentation here.
In order to run of develop the pipeline and publish dataset to an online data catalog the pipeline must first be configured. This is done with the following steps:
If you haven't configured your environment for running Typescript at all, you should first install Node.js and Yarn.
We are then installing the node.js
dependencies by running the yarn
command.
Finally the yarn build
command to transpile the Typescript files in the src/
directory into JavaScript files. This concludes the Typescript part of the pipeline.
Next to TypeScript the pipeline will also make use of a JAVA jar. For active development on the JAVA part of the repository you will need to install JAVA and maven.
When JAVA and maven have been properly set up, compiling the JAR can be done by moving to the crawler
directory and executing the command: mvn --quiet -e -f ./pom.xml clean assembly:assembly
. Which will build the executable JAR in the repository.
To develop and run the pipeline locally the correct environment variables have to be set. 4 environment variables have to be set, whereby the LOCAL_QUERY
is an optional environment variable.
SOURCE_DATASET
:: The IRI that denotes the source dataset in the NDE Dataset Registry.DESTINATION_DATASET
:: The name of the dataset in TriplyDB. The name must only use alphanumeric characters and hyphen (-
).LOCAL_QUERY
:: The location where the transformation can be found. Notice that this is only used when the source dataset does not include a transformation query in its metadata.TRIPLYDB_TOKEN
:: Set the default TriplyDB token; should be aligned with the host TriplyDB instance. The token must have at least read and write access. To create your API Token you can follow the guidelines to create and configure and API Token.The SOURCE_DATASET
,DESTINATION_DATASET
and LOCAL_QUERY
can also be set in the configuration.tsv
. The configuration.tsv
will contain all NDE Dataset Registry datasets that can be transformed to the Europeana linked data format. (see Section 4.1 for expanding the configuration file. Only local development from the command line will make use of the environment variables. The docker containers will always use the configuration.tsv
.
The pipeline is configured to both run from a docker container as well as running outside the docker container from the command line. Running the pipeline from the commandline instead of the docker containers greatly improves debuggability of the pipeline, and makes it easier to test a single dataset.
To run the pipeline from the command line the environment variables need to be set.
The first steps of the pipeline are written in Typescript, but are executed in JavaScript. The following command transpiles your Typescript code into the corresponding JavaScript code: yarn build
Some developers do not want to repeatedly write the yarn build
command. By running the following command, transpilation is performed automatically whenever one or more Typescript files are changed: yarn dev
The pipeline step developed in JAVA also needs to be transpiled. This can be done by entering the crawler
directory from the command line and executing the command: mvn --quiet -e -f ./pom.xml clean assembly:assembly
This will transpile the JAVA code in a JAR executable.
Set up some generic environment variables with:
./.envrc
This will create a .envrc-private in the root directory. Edit this script to set the variables for your environment. To upload the converted data to a TriplyDB instance or to convert larger dataset (>20Mb) set a TriplyDB apikey through the TRIPLYDB_TOKEN. The .envrc-private script also has a placeholder for setting the variables for NDE Datasetregister (URL and query) and EDM SHACL validation script. They are set by default in the code but can be overidden by the setting the environment variables.
Each time running the pipeline the .envrc-private script must be sourced by using:
source ./.envrc-private
The CI/CD setup is stil under development (see below). For single runs the static/scripts/runall.sh
script can be used, a script with environment variables for each datasource must be provided. See static/scripts/env-example
for more information.
The following command runs the first part of the pipeline:
yarn ratt ./lib/main.js
This command will retrieve the dataset metadata from the SOURCE_DATASET
environment variable. It will then look for a viable SPARQL endpoint and construct the EDM mapped linked data from the endpoint. Finally the pipeline validates the linked data and creates the local linked data file ready to be transformed by the JAR executable.
To run the rdf2edm locally we need to move to the correct library to run the JAR, which will be executed by running the bash script on the second line.
cd crawler/target/
./cc-lod-crawler-DockerApplication/rdf2edm-local.sh
The bash script runs the JAR that transforms the linked data file to the format Europeana can ingest. The
The following command runs the final part of the pipeline moving back to the main folder, the afterhook for uploading the xml assets and the linked data.
cd ../../
yarn ratt ./lib/after.js
This will upload the linked data to the DESTINATION_DATASET
on the instance where the TRIPLYDB_TOKEN
was created. The EDM xml files will be uploaded to the same DESTINATION_DATASET
as a zipped asset.
Sometimes you want to run the entire pipeline in a single command. To run the pipeline you can copy paste the following set of commands:
yarn ratt ./lib/main.js && \
cd crawler/target/ && \
./cc-lod-crawler-DockerApplication/rdf2edm-local.sh && \
cd ../../ && \
yarn ratt ./lib/after.js
Running the pipeline in docker containers reduces the amount of configurability needed and will more closely resemble the pipeline when it is running in the gitlab CI/CD. To run the pipeline we first need to build the docker images. To build the docker images we need to run:
Building the docker image for the edm-conversion (steps 1 through 4 and 6)
docker build -f ./config/docker/Dockerfile -t edm-conversie-etl .
Building the docker image for the rdf2edm-conversion (step 5)
docker build -f ./crawler/Dockerfile -t edm-conversie-crawler .
Let's run the docker container with containing the first 4 steps. We do share volumes between the different containers so note that the volume -v
is correctly shared between containers.
docker run --rm \
-v /scratch/edm-conversie-project-acceptance:/home/triply/data \
-e TRIPLYDB_TOKEN=${TRIPLYDB_TOKEN} \
-e LOCAL_QUERY=${LOCAL_QUERY} \
-e SOURCE_DATASET=${SOURCE_DATASET} \
-e DESTINATION_DATASET=${DESTINATION_DATASET} \
-e MODE=acceptance \
--name edm-conversie-project-acceptance \
edm-conversie-etl \
./config/runEtl.sh main
docker run --rm \
-v /scratch/edm-conversie-project-acceptance/rdf:/data \
-e LOCAL_QUERY=${LOCAL_QUERY} \
-e SOURCE_DATASET=${SOURCE_DATASET} \
-e DESTINATION_DATASET=${DESTINATION_DATASET} \
-e MODE=acceptance \
--name edm-conversie-project-acceptance \
edm-conversie-crawler \
./rdf2edm.sh
docker run --rm \
-v /scratch/edm-conversie-project-acceptance:/home/triply/data \
-e TRIPLYDB_TOKEN=${TRIPLYDB_TOKEN} \
-e LOCAL_QUERY=${LOCAL_QUERY} \
-e SOURCE_DATASET=${SOURCE_DATASET} \
-e DESTINATION_DATASET=${DESTINATION_DATASET} \
-e MODE=acceptance \
--name edm-conversie-project-acceptance \
edm-conversie-etl \
./config/runEtl.sh after
The gitlab-ci.yml
contains the necessary instructions for the gitlab CI to run the docker images in the CI/CD pipeline. The .yml
file contains the build and execution procedures both docker images and for running the images in Acceptance or Production mode.
To accommodate for transforming multiple datasets in a single go the repository also has a configuration.tsv
file. The tab delimited file contains the different variables that need to be set per dataset and can be easily expanded.
At the moment the tsv has the following headers:SOURCE_DATASET
,DESTINATION_DATASET
,LOCAL_QUERY
corresponding with the environment variables needed to be set. The CI/CD and the local docker images will use the configuration file if the header variables have not been set. To run the docker images locally with the configuration.tsv
file. Remove the environment variables (SOURCE_DATASET
,DESTINATION_DATASET
,LOCAL_QUERY
) from the docker commands when running the docker images from the command line. The docker images will automatically use the configuration.tsv
file.
Every ETL is be able to run in at least two modes:
By default, ETLs are run in acceptance mode. They should be specifically configured to run in production mode.