subhh / HOS-MetadataTransformations

DEPRECATED - no longer actively maintained. Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.
http://openscience.hamburg.de
GNU General Public License v3.0
19 stars 1 forks source link
batch-processing code4lib harvester openrefine solr workflow

HOS-MetadataTransformations

Codacy Badge

Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.

Use case

  1. Harvest metadata in different standards (dublin core, datacite, ...) from multiple OAI-PMH endpoints
  2. Transform harvested data with specific rules for each source to produce normalized and enriched data
  3. Load transformed data into a Solr search index (which serves as a backend for a discovery system, e.g. HOS-TYPO3-find)

Data Flow

mermaid flowchart

Source: flowchart.mmd (try mermaid live editor)

Preview

preview

Features

System requirements

Installation

tested with Ubuntu 16.04 LTS and Ubuntu 18.04 LTS

install git:

sudo apt install git

clone this git repository:

git clone https://github.com/subhh/HOS-MetadataTransformations.git
cd HOS-MetadataTransformations

install openjdk-8-jre-headless, zip, curl, jq, metha 1.29, OpenRefine 3.2 beta, openrefine-client 0.3.4 and Solr 7.3.1:

sudo ./install.sh

Configure Solr schema:

./init-solr-schema.sh

Usage

Data will be available after first run at:

Run workflow with data source "uhhediss" and load data into local Solr (-s) and local OpenRefine service (-d)

bin/uhhediss.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Run workflow with all data sources in parallel and load data into local Solr (-s) and local OpenRefine service (-d):

./run.sh -s http://localhost:8983/solr/hos -d http://localhost:3333

Run workflow with all data sources and load data into two external Solr cores (-s) and external OpenRefine service (-d)

./run.sh -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -s https://openscience.hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80

Solr authentication

If your external Solr is secured with username/password (Basic Authentication Plugin), you may provide the credentials by copying cfg/solr/credentials.example to cfg/solr/credentials and fill in username and password.

cp cfg/solr/credentials.example cfg/solr/credentials
nano cfg/solr/credentials
chmod 400 cfg/solr/credentials

Cronjob

Example for daily cronjob at 00:35 AM to run workflow with all data sources, load data into external Solr core (-s) and external OpenRefine service (-d) and delete files older than 7 days (-x)

command="$(readlink -f run.sh) -s https://hosdev.sub.uni-hamburg.de/solrAdmin/HOS -d http://openrefine.sub.uni-hamburg.de:80 -x 7"
job="35 0 * * * $command"
cat <(fgrep -i -v "$command" <(crontab -l)) <(echo "$job") | crontab -

Add a data source

./load-new-data.sh -c yourdatasource -i http://ediss.sub.uni-hamburg.de/oai2/oai2.php
cp -a bin/uhhediss.sh bin/yourdatasource.sh
gedit bin/yourdatasource.sh
bin/yourdatasource.sh -s http://localhost:8983/solr/hos -d http://localhost:3333