hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
15 stars 4 forks source link

Remove delimiter from subjectLabel #312

Closed acka47 closed 7 years ago

acka47 commented 7 years ago

Reported by @aquast via email:

Bei der Ressource http://lobid.org/resource/HT018433961/about sind im Feld
"subjectLabel" anscheinend die Slashes als Trennung der Items verwendet worden:

JLD-Anzeige:

"subjectLabel" : [ "1400-1468; Guttemberg, Giovanni", "1400-1468; Gensfleisch, Johann", "1400-1468; Gutenberg, Giovanni", "1400-1468; Guttenbergius, Joannes", "1400-1468; Gensfleisch ZurLaden, Johannes", "1400-1468; Gensfleisch zur Laden, Johannes", "1400-1468; Gutenberg, Johann", "1400-1468; Gensfleisch zur Lade, Johannes", "1400-1468; Gutenberg, Jean", "1400-1468", "1400-1468; Gutemberg, Joannes", "1400-1468; Gutenberg, Iogann", "1400-1468; Gensfleisch, Johannes", "1400-1468; Gensfleisch von Sorgenloch, Johann", "1400-1468; Guttenberg, Johann", "1400-1468; Gutenberg, Johann G.", "Gutenberg, Johannes", "1400-1468; Gensfleisch zum Gutenberg, Johann" ],

Intendiert waren vom Hersteller der Daten (LBZ) eigentlich die wohl übliche Trennung anhand der Semikola.

Gibt es eine Ursache für die Trennung generell und speziell nach "/" Oder könnt Ihr das ggf. schnell korrigieren?

Stephani hat auf meine Rückfrage empfohlen, das die aus Feld 710 a kommenden Daten nicht aufgetrennt werden sollten. Das wäre mir auch am liebsten.

@aquast says we should rather store this as one string or use ; as delimiter. See Morph lines 1264-L1268.

fsteeg commented 7 years ago

Build for https://github.com/lobid/lodmill/pull/791 was passing, merged.

Weekly index creation pulls master, so the change should be automatically deployed over the weekend.

fsteeg commented 7 years ago

Index size for lobid-resources is about 61 GB, see: http://gaia.hbz-nrw.de:9200/_plugin/head/

Free space on gaia is 57 GB, so new index over the weekend will be an issue.

Deleting the current staging index, setting alias to production index.

acka47 commented 7 years ago

No changes at http://lobid.org/resource?id=HT018433961&format=full. Did the full re-indexing run as planned during the weekend?

fsteeg commented 7 years ago

No, index creation failed for both API 1.x and 2.0.

The issue seems to be that the actual baseline dump file is missing:

ls -al /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/aliasNewestFulldump.tar.gz
lrwxrwxrwx 1 800 800 87 Jul 24 06:44 /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/aliasNewestFulldump.tar.gz -> /files/open_data/open/DE-605/mabxml/DE-605-aleph-baseline-marcxchange-2016072318.tar.gz

tail /files/open_data/open/DE-605/mabxml/DE-605-aleph-baseline-marcxchange-2016072318.tar.gz
tail: cannot open `/files/open_data/open/DE-605/mabxml/DE-605-aleph-baseline-marcxchange-2016072318.tar.gz' for reading: No such file or directory

@dr0i: Could this be related to our new setup for getting the catalog data (see https://github.com/hbz/lobid-resources/issues/91)?

@acka47: The transformation change in https://github.com/lobid/lodmill/commit/e3f4bd5c5f875bc2f226af7c1da2ef7ad166194d was only for API 1.x, we'll need to do the same in https://github.com/hbz/lobid-resources, or are we taking a different approach for API 2.0?

acka47 commented 7 years ago

The transformation change in https://github.com/lobid/lodmill/commit/e3f4bd5c5f875bc2f226af7c1da2ef7ad166194d was only for API 1.x, we'll need to do the same in https://github.com/hbz/lobid-resources, or are we taking a different approach for API 2.0?

The plan was to get rid of subjectLabel altogether in API 2.0 (see hbz/lobid-resources#8). Thus, we will have to think about where to put the contents of 710. I already adressed the problem in https://github.com/hbz/lobid-resources/issues/8#issuecomment-214744018.

fsteeg commented 7 years ago

Manually triggered new index creation from tar.gz baseline at http://index.hbz-nrw.de/alephxml/export/baseline/2016072214/, based on what happens in the weekly crontab job for hduser@weywot1:

cd /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01

wget http://index.hbz-nrw.de/alephxml/export/baseline/2016072214/DE-605-aleph-baseline-marcxchange-2016072214.tar.gz

DATE=$(date "+%Y%m%d-%H%M")

BRANCH=master

bash -x startHbz01ToLobidResources.sh $BRANCH /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01/DE-605-aleph-baseline-marcxchange-2016072214.tar.gz lobid-resources-$DATE "-staging" quaoar2.hbz-nrw.de quaoar create doc/scripts/hbz01/toBeUpdateFilesXmlClobs_afterBasedump.txt > $DATE-$BRANCH.staging.log.startHbz01ToLobidResources.sh 2>&1 &

Update file used above looks good:

cat ../../../doc/scripts/hbz01/toBeUpdateFilesXmlClobs_afterBasedump.txt
/files/open_data/open/DE-605/mabxml/DE-605-aleph-update-marcxchange-20160723-20160724.tar.gz
/files/open_data/open/DE-605/mabxml/DE-605-aleph-update-marcxchange-20160724-20160725.tar.gz

Indexing into lobid-resources-20160725-1244, see http://quaoar2.hbz-nrw.de:9200/_plugin/head/

fsteeg commented 7 years ago

Deployed to staging, see: http://test.lobid.org/resource?id=HT018433961&format=full

acka47 commented 7 years ago

Looks good. Did you take care of all all the updates? If yes: +1

fsteeg commented 7 years ago

Logs for the updates that ran as part of the manual indexing (see https://github.com/hbz/lobid/issues/312#issuecomment-234923577) on weywot1 in /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01/20160725-1244-master.staging.log.startHbz01ToLobidResources.sh (finished at 04:33) and regular automated updates in /home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01/log/20160726-071001-master.staging.log.startHbz01ToLobidResources.sh (starting at 07:10) look good.

fsteeg commented 7 years ago

Deployed to production, closing. See:

See http://lobid.org/resource?id=HT018433961&format=full

(Added https://github.com/hbz/lobid-resources/issues/91#issuecomment-235198285 to open issue about the baseline file problem.)

aquast commented 7 years ago

thx 1+