Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.
Source: flowchart.mmd (try mermaid live editor)
tested with Ubuntu 16.04 LTS and Ubuntu 18.04 LTS
install git:
sudo apt install git
clone this git repository:
git clone https://github.com/subhh/HOS-MetadataTransformations.git
cd HOS-MetadataTransformations
install openjdk-8-jre-headless, zip, curl, jq, metha 1.29, OpenRefine 3.2 beta, openrefine-client 0.3.4 and Solr 7.3.1:
sudo ./install.sh
Configure Solr schema:
./init-solr-schema.sh
Data will be available after first run at:
Run workflow with data source "uhhediss" and load data into local Solr (-s) and local OpenRefine service (-d)
bin/uhhediss.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
Run workflow with all data sources in parallel and load data into local Solr (-s) and local OpenRefine service (-d):
./run.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
Run workflow with all data sources and load data into two external Solr cores (-s) and external OpenRefine service (-d)
./run.sh -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -s https://openscience.hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80
If your external Solr is secured with username/password (Basic Authentication Plugin), you may provide the credentials by copying cfg/solr/credentials.example to cfg/solr/credentials
and fill in username and password.
cp cfg/solr/credentials.example cfg/solr/credentials
nano cfg/solr/credentials
chmod 400 cfg/solr/credentials
Example for daily cronjob at 00:35 AM to run workflow with all data sources, load data into external Solr core (-s) and external OpenRefine service (-d) and delete files older than 7 days (-x)
command="$(readlink -f run.sh) -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80 -x 7"
job="35 0 * * * $command"
cat <(fgrep -i -v "$command" <(crontab -l)) <(echo "$job") | crontab -
yourdatasource
with OAI-PMH endpoint http://ediss.sub.uni-hamburg.de/oai2/oai2.php
:./load-new-data.sh -c yourdatasource -i http://ediss.sub.uni-hamburg.de/oai2/oai2.php
Step 2: Explore the data in OpenRefine at http://localhost:3333 (project yourdatasource_new
) and create transformations until data looks fine and suits the Solr schema.
Step 3: Extract the OpenRefine project history in json format and save it in a subdirectory of cfg/, e.g. cfg/yourdatasource/transformation.json
.
Step 4: Copy an existing bash shell script (e.g. bin/uhhediss.sh to bin/yourdatasource.sh
and edit line 17 (codename of the source, e.g. yourdatasource
) and line 18 (url to OAI-PMH endpoint, e.g. http://ediss.sub.uni-hamburg.de/oai2/oai2.php
). If you load a big dataset you may need to allocate more memory to OpenRefine (line 19).
cp -a bin/uhhediss.sh bin/yourdatasource.sh
gedit bin/yourdatasource.sh
bin/yourdatasource.sh -s http://localhost:8983/solr/hos -d http://localhost:3333
yourdatasource_live
) and Solr (query: collectionId:yourdatasource)