hbz / mabxml-elasticsearch

Raw hbz union catalog data exposed via a web API
http://lobid.org/hbz01
3 stars 1 forks source link

Update hbz01 Elasticsearch version #25

Closed fsteeg closed 8 years ago

fsteeg commented 8 years ago

See https://github.com/hbz/lobid/issues/300

fsteeg commented 8 years ago

Will update Elasticsearch on quaoar1, which is currently only used for the geodata index.

Started creation of a new geodata index on the quaoar cluster (quaoar1 and 2).

(lod@gaia:~/git/geodata-staging on port 7401, used by sol@quaoar1:~/git/lobid-organisations-staging)

dr0i commented 8 years ago

quaoar cluster (quaoar1 and 2).

You meant quaoar2 and quaoar3 and did the things accordingly :)

fsteeg commented 8 years ago

You meant quaoar2 and quaoar3 and did the things accordingly :)

Oh right, sorry for the confusion.

fsteeg commented 8 years ago

New index on quaoar cluster is done, was used for creating current staging data for lobid-organisations (see http://test.lobid.org/organisations/search). Switched lod@gaia:~/git/geodata to use the new index.

Elasticsearch instance on quaoar1 is now no longer used. Will update Elasticsearch on quaoar1 next.

fsteeg commented 8 years ago

Replaced the old Elasticsearch on quaoar1 with Elasticsearch 2.3.3, installed via sudo dpkg -i elasticsearch-2.3.3.deb (from https://www.elastic.co/downloads/elasticsearch) in /usr/share/elasticsearch/, conf in /etc/elasticsearch/, logs in /var/log/elasticsearch/, and installed the Elastic HQ plugin (an alternative to the head plugin, see http://elastichq.org):

http://quaoar1.hbz-nrw.de:9200/ & http://quaoar1.hbz-nrw.de:9200/_plugin/hq/

fsteeg commented 8 years ago

Started indexing with https://github.com/hbz/mabxml-elasticsearch/commit/8723a3fc6a6bb5293f8eae771203a814d04c5ed7 in sol@quaoar1:~/git/mabxml-elasticsearch-staging with:

mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args="/files/open_data/open/DE-605/mabxml/ gz quaoar1 193.30.112.170 hbz01-20160630-1315" > log/processMabxml.sh.20160630-1315.log 2>&1 &

fsteeg commented 8 years ago

New index was created. Created alias hbz01-test for new index.

Copied updates since index creation from /files/open_data/open/DE-605/mabxml/ to sol@quaoar1:~/git/mabxml-elasticsearch-staging/updates:

sol@quaoar1:~/git/mabxml-elasticsearch-staging/updates$ ls
DE-605-aleph-update-marcxchange-20160630-20160701.tar.gz
DE-605-aleph-update-marcxchange-20160703-20160704.tar.gz
DE-605-aleph-update-marcxchange-20160702-20160703.tar.gz

Indexed these updates with:

mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args="/home/sol/git/mabxml-elasticsearch-staging/updates gz quaoar1 193.30.112.170 hbz01-test" > log/processMabxml.sh.20160704-0930.log 2>&1 &

fsteeg commented 8 years ago

Set up separate updates test transformation at hduser@weywot2 and added crontab entry:

20 09 * * * ssh hduser@weywot2 "cd ~/git/mabxml-elasticsearch-test/src/main/resources ; bash transform.sh"

Configured in src/main/resources/transform.sh to update the new index using the hbz01-test alias, used by http://test.lobid.org/hbz01:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

# Execute via crontab by hduser@weywot1:
# 20 09 * * * ssh hduser@weywot2 "cd ~/git/mabxml-elasticsearch-test/src/main/resources ; bash transform.sh"

export MAVEN_OPTS="-Dfile.encoding=UTF-8 -Xmx1024M -Xss128M -XX:+CMSClassUnloadingEnabled"

# Determine the latest update file and store it locally:
updates=http://dataproxy.lobid.org/alephxml/export/update/
date=$(date "+%Y%m%d")
updateFile=$(curl $updates | grep 'tar.gz' | cut -d '"' -f2 | grep $date)
cd updates ; wget $updates$updateFile ; cd ../../../..

# Run the transformation with the latest file (and possibly unprocessed previous files):
mvn clean install >> log/processMabxml.sh.$date.log 2>&1
mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args="src/main/resources/updates/ gz quaoar1 193.30.112.170 hbz01-test" >> log/processMabxml.sh.$date$

# Clean up and move updates to the full data directory (skipped if transformation fails, due to -e option):
cd src/main/resources/
# cp updates/* /files/open_data/open/DE-605/mabxml/
rm updates/*
fsteeg commented 8 years ago

Triggered creation of second index (for separate prod and test indexes) for data:

sol@quaoar1:~/git/mabxml-elasticsearch-staging$ ls  /files/open_data/open/DE-605/mabxml/
DE-605-aleph-baseline-marcxchange-2016062414.tar.gz       DE-605-aleph-update-marcxchange-20160629-20160630.tar.gz
DE-605-aleph-update-marcxchange-20160625-20160626.tar.gz  DE-605-aleph-update-marcxchange-20160630-20160701.tar.gz
DE-605-aleph-update-marcxchange-20160626-20160627.tar.gz  DE-605-aleph-update-marcxchange-20160702-20160703.tar.gz
DE-605-aleph-update-marcxchange-20160627-20160628.tar.gz  DE-605-aleph-update-marcxchange-20160703-20160704.tar.gz
DE-605-aleph-update-marcxchange-20160628-20160629.tar.gz  README.htm

With https://github.com/hbz/mabxml-elasticsearch/commit/8723a3fc6a6bb5293f8eae771203a814d04c5ed7 in sol@quaoar1:~/git/mabxml-elasticsearch-staging:

mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args="/files/open_data/open/DE-605/mabxml/ gz quaoar1 193.30.112.170 hbz01-20160704-1030" >> log/processMabxml.sh.20160704-1030.log 2>&1 &

fsteeg commented 8 years ago

Deployed to test system: http://test.lobid.org/hbz01

dr0i commented 8 years ago

Does not work: "Alias [hbz01-test] has more than one indices associated with it"

fsteeg commented 8 years ago

Thanks for testing! I had just added an hbz01-test alias to the new, second index today. My idea was to keep both up to date via the new crontab entry, but obviously that doesn't work. I've set up a hbz01 alias for the new index instead, and manually triggered todays's updates for both. I disabled the deletion of the updates in sol@quaoar1:~/git/mabxml-elasticsearch-staging and will index them manually into hbz01 when deploying to production (after functional and code review).

dr0i commented 8 years ago

+1