hbz / mabxml-elasticsearch

Raw hbz union catalog data exposed via a web API
http://lobid.org/hbz01
3 stars 1 forks source link

Production setup with incremental daily updates #20

Closed fsteeg closed 8 years ago

fsteeg commented 8 years ago

Our production setup currently indexes full data daily, should use a directory with updates only

fsteeg commented 8 years ago

Set up incremental updates via crontab on hduser@weywot1 by copying the latest update file to a directory and passing that directory to the flow.Transform program.

The latest file is determined by:

updateFile=$(date=$(date "+%Y%m%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date)

Copying on weywot2:

cd /home/hduser/git/mabxml-elasticsearch/; rm updates/*; cp /files/open_data/open/DE-605/mabxml/$updateFile updates/

Run transformation, passing the update directory and other production settings:

mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args="/home/hduser/git/mabxml-elasticsearch/updates/ gz quaoar 193.30.112.171 hbz01-mabxml" >log/processMabxml.sh.$(date "+%Y%m%d").log 2>&1

fsteeg commented 8 years ago

Reopening: the updating logic is currently part of the crontab command. For better reproducibility of our setup, it should be part of the Transform.java application, e.g. via an update parameter.

fsteeg commented 8 years ago

After some tinkering with possible additions to the Transform.java class, I've moved the logic from crontab into a script in the mabxml-elasticsearch repo instead (see https://github.com/hbz/mabxml-elasticsearch/pull/23).

Crontab now only logs into the machine where the transformation runs and calls that script:

20 06 * * * ssh hduser@weywot2 "cd ~/git/mabxml-elasticsearch/src/main/resources ; bash transform.sh"

Since our crontab is not versioned, here is the old entry:

20 06 * * * cd /files/open_data/open/DE-605/mabxml/; updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date); wget http://dataproxy.lobid.org/alephxml/export/update/$updateFile ;cd /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01 ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" >> toBeUpdateFilesXmlClobs_afterBasedump.txt; ssh hduser@weywot2 "export M2_HOME=/usr/share/maven; export MAVEN_OPTS=\"-Dfile.encoding=UTF-8 -Xmx1024M -Xss128M -XX:MaxPermSize=1024M -XX:+CMSClassUnloadingEnabled\"; cd /home/hduser/git/mabxml-elasticsearch/; rm updates/*; cp /files/open_data/open/DE-605/mabxml/$updateFile updates/ ; mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args=\"/home/hduser/git/mabxml-elasticsearch/updates/ gz quaoar 193.30.112.171 hbz01-mabxml\" >log/processMabxml.sh.$(date "+\%Y\%m\%d").log 2>&1"

Below that was this, commented out:

`# complete directory (with full dump), by leave out last parameter for flow.Transform

01 16 * cd /files/open_data/open/DE-605/mabxml/; updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date); wget http://dataproxy.lobid.org/alephxml/export/update/$updateFile ;cd /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01 ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" |tee --append toBeUpdateFilesXmlClobs_afterBasedump.txt >> toBeUpdateFilesXmlClobs_afterBasedump.txt; ssh hduser@weywot2 "export M2_HOME=/usr/share/maven; cd /home/hduser/git/mabxml-elasticsearch/ ; mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" >log/processMabxml.sh.$(date "+\%Y\%m\%d").log 2>&1"`

Moved appending the update file name to the text file into the lodmill job, where it's used:

updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date) ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" >> toBeUpdateFilesXmlClobs_afterBasedump.txt

Resulting in this entry for the lodmill update job:

40 06 * * * BRANCH=master; DATE=$(date "+\%Y\%m\%d-\%H\%M\%S"); cd /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01 ; updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date) ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" >> toBeUpdateFilesXmlClobs_afterBasedump.txt ; bash -x startHbz01ToLobidResources.sh $BRANCH $(tail -n1 toBeUpdateFilesXmlClobs_afterBasedump.txt) lobid-resources NOALIAS quaoar2.hbz-nrw.de quaoar exact > log/$DATE-$BRANCH.log.startHbz01ToLobidResources.sh 2>&1

fsteeg commented 8 years ago

Merged script and updated documentation with full localhost setup instructions in README via https://github.com/hbz/mabxml-elasticsearch/issues/23.