Reason (Why?)
Right now the ETL consists of the xml-import and the elasticsearch-upload steps. To have an automated ETL these steps needs to be bundled into one process.
Solution (What?)
First clean up the etl directory tree. There are many not needed files from the openartbrowser project which can be removed. Afterwards we can flatten the etl directory tree.
Create and etl-setup.shscript that sets up and installs the required python environment:
Install the python packages from requirements.txt into the virtual environment
Create an etl wrapper script (e.g. like etl.sh in the openartbrowser project) that takes care of the following steps:
Use the virtual python environment: source venv/bin/activate
Handle paths of input xml-files and output json files
Run the xml-importer with theses input xml-files and store the json files
Run the elasticsearch_uploader.py script with the generated json files.
Relation to other Issues
This issue is part of #3
Acceptance criteria
The etl.sh wrapper script can be executed on the staging (and production) server and creates a new elasticsearch index from the newly parsed xml files.
Reason (Why?) Right now the ETL consists of the xml-import and the elasticsearch-upload steps. To have an automated ETL these steps needs to be bundled into one process.
Solution (What?) First clean up the etl directory tree. There are many not needed files from the openartbrowser project which can be removed. Afterwards we can flatten the etl directory tree.
Create and
etl-setup.sh
script that sets up and installs the required python environment:virtualenv venv
requirements.txt
into the virtual environmentCreate an etl wrapper script (e.g. like
etl.sh
in the openartbrowser project) that takes care of the following steps:source venv/bin/activate
Relation to other Issues This issue is part of #3
Acceptance criteria The
etl.sh
wrapper script can be executed on the staging (and production) server and creates a new elasticsearch index from the newly parsed xml files.