Closed fsteeg closed 8 years ago
Original issue seems to be failing download of latest baseline dump from persephone in gaia:/opt/hadoop/cron/copyNewestFullDump.sh
which is called from hduser@weywot1
crontab (can't connect via SSH, maybe a missing key or account on the new persephone system).
Manually downloaded to gaia:/files/open_data/open/DE-605/mabxml
with wget http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
, manually set alias and started full indexing (as in hduser@weywot1
crontab).
For a permanent solution, we need to fix the automated download. It might make sense to get it over HTTP in general (as I did manually above). The wget
above took about 2 minutes for 7.5 GB, so no issue there. Contacted JP to make sure getting it from http://index.hbz-nrw.de
makes sense.
Indexing worked and JP confirmed to use http://index.hbz-nrw.de/alephxml/export/
Next: set up baseline downloads via HTTP in server setup starting from crontab for hduser@weywot1
Adding script changes below as affected script is not under version control.
Replaced the old content of gaia:/opt/hadoop/cron/copyNewestFullDump.sh
:
DIR=/files/open_data/open/DE-605/mabxml
oldFile=$(ls $DIR/DE-605-aleph-base*2*.tar.gz)
oldUpdateFiles=$(ls $DIR/DE-605-aleph-update-marcxchange-*.tar.gz)
ssh admin@persephone 'cd /data/alephxml/export/baseline/ ; a=$(ls -cR | grep tar.gz | head -n 1); a=$(find . -name $a) ; scp $a hduser@gaia:/files/open_data/open/DE-605/mabxml'
#mv $oldFile /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/DE-605-aleph-newestBackupOfMonth-$(date +%m).tar.gz
#for i in $oldUpdateFiles; do rm $i; done
With new content:
#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
wget --no-verbose $BASELINE_URL
# See also https://github.com/hbz/lobid/issues/304
And changed crontab entry to redirect output to a log file:
ssh gaia 'cron/copyNewestFullDump.sh > cron/copyNewestFullDump.log 2>&1 ; [...]'
Tested trigger from crontab for hduser@weywot1
, closing.
Reopening: weekly updates don't pick up latest baseline. Times at http://index.hbz-nrw.de/alephxml/export/baseline/ look good, crontab for hduser@weywot1
timed at 5:20, it should see the latest baseline. Manual execution of script yields correct baseline. Added debug output of actual date in the script, see current content below. Keeping previous baseline index including all updates as productive index.
hduser@gaia:/opt/hadoop/cron/copyNewestFullDump.sh:
#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'
echo "Copy newest baseline, date: $(date)"
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# File name, e.g. DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_FILE="DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/$BASELINE_FILE"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
if [ -f $BASELINE_FILE ]; then
echo "File already exists, exit 1"
exit 1
fi
wget --no-verbose $BASELINE_URL
# See also https://github.com/hbz/lobid/issues/304
Logging output confirms that the timing should be correct: Sa 14. Mai 05:20:01 CEST 2016
, but got DE-605-aleph-baseline-marcxchange-2016050614.tar.gz
, even though according to http://index.hbz-nrw.de/alephxml/export/baseline/2016051314/
, the latest dump was written on 13-May-2016 23:04
.
Running curl http://index.hbz-nrw.de/alephxml/export/baseline/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev
(see script above) now gives the correct result, 2016051314
. Perhaps the file timestamp is misleading, and we should schedule the cron job for a later time, @dr0i?
Not switching to new index, as it would be missing updates. To save space, we should delete it.
Note: also the http://lobid.org/download/dumps/DE-605/mabxml/ is messed up - these files are build by the script, but the crucila commands were outcommented even in the original fiel, see https://github.com/hbz/lobid/issues/304#issuecomment-214714629 . Commented them in so that the old files will be moved. (for diffs, I made copies of the files suffxing a timestamp).
Me also don't comprehend the cause of the problem. Therefore added a debug parameter to the script to have more information ("bash -x ..."). Also, the nfs server demeter
is now unmounted (not sure if this has something to do with it).
@fsteeg again, please check on monday if this is working. Will, if necessary, analyze further at tuesday.
Same issue, took DE-605-aleph-baseline-marcxchange-2016051314.tar.gz
, no additional output in log.
Still not clear. Did the following though:
We have to wait till next saturday. Since the resources are updated in the productive older index from 2016-05-07 this is no problem. As long as there was no mapping in the data transformations since then (which would only apply to the updates, not the base). Is this so, @fsteeg ? Otherwise I would make sure that all updates are indexed into the newest index from 2016-05-20 and switch to this index.
No, there were no transformation changes that are not productive yet, so +1 for keeping the old index.
For next week's run, maybe we should try a bigger time change, like 9 hours (Saturday afternoon)?
Bigger time change is an option, agreed. But implies to have daily updates accordingly later for saturday AND to have an extra entry in crontab for the daily update at saturday. Foremost I would want to know what's going on there and thus just wait what the logs tell us next time before we increase the time of getting and feeding the base data.
The cause of the phenomenon was the rsyncing of the file to the webserver at 6:01, thus preserving the timestamp on the file, thus confusion. Modified the cron to start at 6:15. That worked well. Closing.
Weekly full index creation failed due to changes in server infrastructure we depend on. Affects API 1.x and data 2.0.