hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
16 stars 4 forks source link

Fix weekly full index creation #304

Closed fsteeg closed 8 years ago

fsteeg commented 8 years ago

Weekly full index creation failed due to changes in server infrastructure we depend on. Affects API 1.x and data 2.0.

fsteeg commented 8 years ago

Original issue seems to be failing download of latest baseline dump from persephone in gaia:/opt/hadoop/cron/copyNewestFullDump.sh which is called from hduser@weywot1 crontab (can't connect via SSH, maybe a missing key or account on the new persephone system).

Manually downloaded to gaia:/files/open_data/open/DE-605/mabxml with wget http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz, manually set alias and started full indexing (as in hduser@weywot1 crontab).

For a permanent solution, we need to fix the automated download. It might make sense to get it over HTTP in general (as I did manually above). The wget above took about 2 minutes for 7.5 GB, so no issue there. Contacted JP to make sure getting it from http://index.hbz-nrw.de makes sense.

fsteeg commented 8 years ago

Indexing worked and JP confirmed to use http://index.hbz-nrw.de/alephxml/export/

Next: set up baseline downloads via HTTP in server setup starting from crontab for hduser@weywot1

fsteeg commented 8 years ago

Adding script changes below as affected script is not under version control.

Replaced the old content of gaia:/opt/hadoop/cron/copyNewestFullDump.sh:

DIR=/files/open_data/open/DE-605/mabxml
oldFile=$(ls $DIR/DE-605-aleph-base*2*.tar.gz)
oldUpdateFiles=$(ls  $DIR/DE-605-aleph-update-marcxchange-*.tar.gz)
ssh admin@persephone 'cd /data/alephxml/export/baseline/ ; a=$(ls -cR | grep tar.gz | head -n 1); a=$(find . -name $a) ; scp $a hduser@gaia:/files/open_data/open/DE-605/mabxml'
#mv $oldFile /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/DE-605-aleph-newestBackupOfMonth-$(date +%m).tar.gz
#for i in $oldUpdateFiles; do rm $i; done

With new content:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
wget --no-verbose $BASELINE_URL

# See also https://github.com/hbz/lobid/issues/304

And changed crontab entry to redirect output to a log file:

ssh gaia 'cron/copyNewestFullDump.sh > cron/copyNewestFullDump.log 2>&1 ; [...]'

Tested trigger from crontab for hduser@weywot1, closing.

fsteeg commented 8 years ago

Reopening: weekly updates don't pick up latest baseline. Times at http://index.hbz-nrw.de/alephxml/export/baseline/ look good, crontab for hduser@weywot1 timed at 5:20, it should see the latest baseline. Manual execution of script yields correct baseline. Added debug output of actual date in the script, see current content below. Keeping previous baseline index including all updates as productive index.

hduser@gaia:/opt/hadoop/cron/copyNewestFullDump.sh:

#!/bin/bash
set -euo pipefail # See http://redsymbol.net/articles/unofficial-bash-strict-mode/
IFS=$'\n\t'

echo "Copy newest baseline, date: $(date)"
BASELINE_ROOT="http://index.hbz-nrw.de/alephxml/export/baseline"
# Date of the latest baseline dump, with trailing slash removed, e.g. "2016042319"
BASELINE_DATE="$(curl $BASELINE_ROOT/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev)"
# File name, e.g. DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_FILE="DE-605-aleph-baseline-marcxchange-$BASELINE_DATE.tar.gz"
# URL of actual baseline file, e.g. http://index.hbz-nrw.de/alephxml/export/baseline/2016042319/DE-605-aleph-baseline-marcxchange-2016042319.tar.gz
BASELINE_URL="$BASELINE_ROOT/$BASELINE_DATE/$BASELINE_FILE"
echo "Getting baseline from $BASELINE_URL"
cd /files/open_data/open/DE-605/mabxml
if [ -f $BASELINE_FILE ]; then
    echo "File already exists, exit 1"
    exit 1
fi
wget --no-verbose $BASELINE_URL

# See also https://github.com/hbz/lobid/issues/304
fsteeg commented 8 years ago

Logging output confirms that the timing should be correct: Sa 14. Mai 05:20:01 CEST 2016, but got DE-605-aleph-baseline-marcxchange-2016050614.tar.gz, even though according to http://index.hbz-nrw.de/alephxml/export/baseline/2016051314/, the latest dump was written on 13-May-2016 23:04.

Running curl http://index.hbz-nrw.de/alephxml/export/baseline/ | grep '20' | cut -d '"' -f2 | tail -n 1 | rev | cut -c 2- | rev (see script above) now gives the correct result, 2016051314. Perhaps the file timestamp is misleading, and we should schedule the cron job for a later time, @dr0i?

Not switching to new index, as it would be missing updates. To save space, we should delete it.

dr0i commented 8 years ago

Note: also the http://lobid.org/download/dumps/DE-605/mabxml/ is messed up - these files are build by the script, but the crucila commands were outcommented even in the original fiel, see https://github.com/hbz/lobid/issues/304#issuecomment-214714629 . Commented them in so that the old files will be moved. (for diffs, I made copies of the files suffxing a timestamp).

dr0i commented 8 years ago

Me also don't comprehend the cause of the problem. Therefore added a debug parameter to the script to have more information ("bash -x ..."). Also, the nfs server demeter is now unmounted (not sure if this has something to do with it). @fsteeg again, please check on monday if this is working. Will, if necessary, analyze further at tuesday.

fsteeg commented 8 years ago

Same issue, took DE-605-aleph-baseline-marcxchange-2016051314.tar.gz, no additional output in log.

dr0i commented 8 years ago

Still not clear. Did the following though:

We have to wait till next saturday. Since the resources are updated in the productive older index from 2016-05-07 this is no problem. As long as there was no mapping in the data transformations since then (which would only apply to the updates, not the base). Is this so, @fsteeg ? Otherwise I would make sure that all updates are indexed into the newest index from 2016-05-20 and switch to this index.

fsteeg commented 8 years ago

No, there were no transformation changes that are not productive yet, so +1 for keeping the old index.

For next week's run, maybe we should try a bigger time change, like 9 hours (Saturday afternoon)?

dr0i commented 8 years ago

Bigger time change is an option, agreed. But implies to have daily updates accordingly later for saturday AND to have an extra entry in crontab for the daily update at saturday. Foremost I would want to know what's going on there and thus just wait what the logs tell us next time before we increase the time of getting and feeding the base data.

dr0i commented 8 years ago

The cause of the phenomenon was the rsyncing of the file to the webserver at 6:01, thus preserving the timestamp on the file, thus confusion. Modified the cron to start at 6:15. That worked well. Closing.