hbz / lobid

Linking Open Bibliographic Data
https://lobid.org/
Eclipse Public License 2.0
15 stars 4 forks source link

Missing HT-Resources #322

Closed jschnasse closed 7 years ago

jschnasse commented 7 years ago

http://lobid.org/resource/HT019239207 http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT019239207 http://lobid.org/resource/HT019239326 http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT019239326 http://lobid.org/resource/HT019239190 http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT019239190 http://lobid.org/resource/HT019241015 http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT019241015

acka47 commented 7 years ago

All are there in API 2.0, see e.g. http://lobid.org/resources/HT019239207

dr0i commented 7 years ago

Don't comprehend yet what has gone wrong. Did a grep on all update Clobs (see http://lobid.org/download/dumps/DE-605/mabxml/) but couldn't find it there. So it should be in the big baseline dump. Looking at the logs of that indexing process I see nothing unusual. The first three resources reside in the old index ,though, see e.g. http://test.lobid.org/resource/HT019239207/about . Just updated that older index and switched back to it. Waiting for weekends new indexing - maybe then it's cured on itself. (But I don't think so => further investigation.)

dr0i commented 7 years ago

Checked if the resource can be transformed and indexed - yes, it can! Hm, hm. So - weekend!

acka47 commented 7 years ago

Three of the four resources are now transformed and index. One is still missing: http://lobid.org/resource/HT019241015/about

dr0i commented 7 years ago

All are there in API 2.0

this is true because this index was an old, unswitched one. In all newer indexes >2017-02-12 these documents are missing. As the indexing test above shows there is no principal problem with transforming and indexing the MabXmlClobs I assume that the resources are missing in the basedump. To investigate this I await the the result of:

tar xfz /files/open_data/open/DE-605/mabxml/DE-605-aleph-baseline-marcxchange-2017022510.tar.gz --ignore-command-error --to-command='grep "HT019239207\|HT019239190\|HT019241015\|HT019239326"'

dr0i commented 7 years ago

The command above had nothing showed up. There is a way faster command IF you now the aleph-ID (which I happen to know), try : nohup tar -ztvf /files/open_data/open/DE-605/mabxml/DE-605-aleph-baseline-marcxchange-2017030410.tar.gz | grep 000805561 > grep_000805561_baseline.txt & You can get the aleph-ID by looking at the MabXmlClobs: http://lobid.org/hbz01/HT002189125 You can also look at the other hbz-index: https://index.hbz-nrw.de/_es2/hbz/_search?q=HT002189125 Conclusion: The resource is missing in our MabXmlClobs source dump. But it doesn't seems to be missing at the source dump used by the other index. For this I have to grep at the other source dump (I don't have access to it yet).

dr0i commented 7 years ago

Increasing parameter max of the dumping-script will resolve the issue. It was set to 21500000 whilst e.g. HT019239207 has a slightly greater number "<identifier>aleph-publish:021502302</identifier>". Thx @jprante for the hint!

dr0i commented 7 years ago

Deployed to production. Missing documents back again. Please have a look @jschnasse.

jschnasse commented 7 years ago

thx!