Closed fsteeg closed 8 years ago
Set up incremental updates via crontab on hduser@weywot1 by copying the latest update file to a directory and passing that directory to the flow.Transform
program.
The latest file is determined by:
updateFile=$(date=$(date "+%Y%m%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date)
Copying on weywot2:
cd /home/hduser/git/mabxml-elasticsearch/; rm updates/*; cp /files/open_data/open/DE-605/mabxml/$updateFile updates/
Run transformation, passing the update directory and other production settings:
mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args="/home/hduser/git/mabxml-elasticsearch/updates/ gz quaoar 193.30.112.171 hbz01-mabxml" >log/processMabxml.sh.$(date "+%Y%m%d").log 2>&1
Reopening: the updating logic is currently part of the crontab command. For better reproducibility of our setup, it should be part of the Transform.java application, e.g. via an update
parameter.
After some tinkering with possible additions to the Transform.java class, I've moved the logic from crontab into a script in the mabxml-elasticsearch repo instead (see https://github.com/hbz/mabxml-elasticsearch/pull/23).
Crontab now only logs into the machine where the transformation runs and calls that script:
20 06 * * * ssh hduser@weywot2 "cd ~/git/mabxml-elasticsearch/src/main/resources ; bash transform.sh"
Since our crontab is not versioned, here is the old entry:
20 06 * * * cd /files/open_data/open/DE-605/mabxml/; updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date); wget http://dataproxy.lobid.org/alephxml/export/update/$updateFile ;cd /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01 ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" >> toBeUpdateFilesXmlClobs_afterBasedump.txt; ssh hduser@weywot2 "export M2_HOME=/usr/share/maven; export MAVEN_OPTS=\"-Dfile.encoding=UTF-8 -Xmx1024M -Xss128M -XX:MaxPermSize=1024M -XX:+CMSClassUnloadingEnabled\"; cd /home/hduser/git/mabxml-elasticsearch/; rm updates/*; cp /files/open_data/open/DE-605/mabxml/$updateFile updates/ ; mvn clean install ; mvn exec:java -Dexec.mainClass="flow.Transform" -Dexec.args=\"/home/hduser/git/mabxml-elasticsearch/updates/ gz quaoar 193.30.112.171 hbz01-mabxml\" >log/processMabxml.sh.$(date "+\%Y\%m\%d").log 2>&1"
Below that was this, commented out:
`# complete directory (with full dump), by leave out last parameter for flow.Transform
Moved appending the update file name to the text file into the lodmill job, where it's used:
updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date) ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" >> toBeUpdateFilesXmlClobs_afterBasedump.txt
Resulting in this entry for the lodmill update job:
40 06 * * * BRANCH=master; DATE=$(date "+\%Y\%m\%d-\%H\%M\%S"); cd /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01 ; updateFile=$(date=$(date "+\%Y\%m\%d"); curl http://dataproxy.lobid.org/alephxml/export/update/ | grep 'tar.gz' | cut -d '"' -f2 | grep $date) ; echo "/files/open_data/open/DE-605/mabxml/$updateFile" >> toBeUpdateFilesXmlClobs_afterBasedump.txt ; bash -x startHbz01ToLobidResources.sh $BRANCH $(tail -n1 toBeUpdateFilesXmlClobs_afterBasedump.txt) lobid-resources NOALIAS quaoar2.hbz-nrw.de quaoar exact > log/$DATE-$BRANCH.log.startHbz01ToLobidResources.sh 2>&1
Merged script and updated documentation with full localhost setup instructions in README via https://github.com/hbz/mabxml-elasticsearch/issues/23.
Our production setup currently indexes full data daily, should use a directory with updates only