hbz / mabxml-elasticsearch

Raw hbz union catalog data exposed via a web API
http://lobid.org/hbz01
3 stars 1 forks source link

2200 MAB sources missing #44

Closed acka47 closed 5 years ago

acka47 commented 6 years ago

Compare http://lobid.org/hbz01/HT000052997 with http://193.30.112.134/F/?func=find-c&ccl_term=IDN%3DHT000052997

acka47 commented 6 years ago

It would be interesting to compare the numbers of titles in Aleph to the numbers in lobid.org/hbz01/ vs. lobid.org/resources. @ChristophEwertowski, can you check via the client how many titles are in the Aleph system at a given point in time?

ChristophEwertowski commented 6 years ago

My direct collegues didn't know how so I wrote a mail to the collegues who were recommended.

ChristophEwertowski commented 6 years ago

Mmh. I got only the link to the wiki page: https://wiki1.hbz-nrw.de/display/VDBE/Statistik+aktueller+Datenbestand. According to this we have ~1600 resources less than exist in Aleph.

dr0i commented 6 years ago

Just a note - this resource was already missing in the old hbz01 of api 1.0 - an index rooted in 2016 (and updated regularly since): curl 'http://quaoar2.hbz-nrw.de:9200/hbz01-mabxml/_search?q=HT000052997' .

dr0i commented 6 years ago

I triggered a fulldump-indexing of hbz01-mabxml from 20180112. This is the date our fulldump was collected. The comparison is from a slightly different date (14.01.2018 | hbz01 | Titeldaten | 20.250.529). I could add the missing two days as updates so that we have definitely good comparable numbers. The index is build in parallel to the one used in production and named hbz01-stage.

dr0i commented 6 years ago

hbz01-stage has 20.248.321 resp. 20.248.333 (depending on the exact time of indexing of the 14.1. (morning/evening)). => ~ 2.208 resources are missing.

dr0i commented 6 years ago

The resource is not even part of the aleph-dump - following cmd doesn't it anything:

tar -vtf DE-605-aleph-baseline-marcxchange-2018011923.tar.gz |grep 000046099

dr0i commented 6 years ago

I increased the "max" parameter in the aleph-dumper.sh considerably, from 26.500.000 to 32.500.000. Maybe this will bring the missing resources into light. We will know next monday.

dr0i commented 6 years ago

Don't know if this increasing of max caused 6 times:

[2018-01-27 16:09:17,173][ERROR][hbz.tools.convert.aleph.AlephPublish2MarcXchangeJSON][pool-3-thread-4] got error, waiting before retry... ORA-01555: Snapshot zu alt: Rollback-Segmentnummer 7 namens "_SYSSMU7_4106315169$" ist zu klein.
java.sql.SQLException: ORA-01555: Snapshot zu alt: Rollback-Segmentnummer 7 namens "_SYSSMU7_4106315169$" ist zu klein.
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461) ~[ojdbc7_g.jar:12.1.0.2.0]

and not sure if this somehow resetted the SQL-query. What I see the resulted baseline-dump

The resets of the SQL would explain that the data was ~ 3 times dowloaded.7 [edit dr0i: confirmed that resources are multiple times added to the archive: tar -vtf DE-605-aleph-newestBackupOfMonth-01.tar.gz |grep 004105264 => 4 hits.]

This file was not ETLed lately because of its late creation. Informed Verbundgruppe about the incident.

dr0i commented 6 years ago

BTW, the missing HT000052997 is still missing. So I reset the max to the old value.

dr0i commented 6 years ago

OK, so this resource is also missing in hbz' others es index. I'll inform them and ask the Verbundgruppe if they have a clue.

acka47 commented 6 years ago

Natascha was just here and asked about the resources that are missing. As we don't know which are missing, I told her that ideally Verbundgruppe would give us a list of all hbzIDs in hbz01 so that we could create a diff with the hbzIDs we have in lobid. She will look what she can do.

dr0i commented 6 years ago

So the cause is a problem residing at the very source at the oracle db. This will be fixed when all resources are identified following the procedure @acka47 mentioned. As HT000052997 is already identified, the resource is now fixed and available in our services, see e.g. http://lobid.org/resources/HT000052997. Thx @hagbeck for original report.

jprante commented 6 years ago

[2018-01-27 16:09:17,173][ERROR][hbz.tools.convert.aleph.AlephPublish2MarcXchangeJSON][pool-3-thread-4] got error, waiting before retry... ORA-01555: Snapshot zu alt: Rollback-Segmentnummer 7 namens "_SYSSMU7_4106315169$" ist zu klein. java.sql.SQLException: ORA-01555: Snapshot zu alt: Rollback-Segmentnummer 7 namens "_SYSSMU7_4106315169$" ist zu klein. at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461) ~[ojdbc7_g.jar:12.1.0.2.0]

Do not use the old version of Aleph publishing consumer, it has bugs. This bug is fixed in the new version.

dr0i commented 6 years ago

made a list of all 2.5k missing resources at https://gist.github.com/dr0i/cb45731d0f95e7b2f21c1d215f2753e0.

dr0i commented 6 years ago

Before we loose this information - from January's emails: there are some bad characters in field resp. subfield definition in the source of the MAB XML Clobs.

dr0i commented 5 years ago

@acka47 can you trigger verbundgruppe?

acka47 commented 5 years ago

made a list of all 2.5k missing resources at https://gist.github.com/dr0i/cb45731d0f95e7b2f21c1d215f2753e0.

Some of the entries are duplicates. Can you please update the list with unique IDs?

dr0i commented 5 years ago

done.

acka47 commented 5 years ago

Colleague S.S. wrote on 2019-01-18:

Deine Liste [i.e. the list above, A.P.] aus 2018 habe ich jetzt analysiert:

Fazit: 74 Datensätze sind nicht in lobid.

Analyse:

Liste enthält Anzahl 2524 Titel-IDN:

  • darin sind 84 gelöschte Datensätze = 2440
  • davon sind 2366 ohne Lokaldatensatz = 74 (2229 nur mit 1 LOW-Feld, 137 ganz ohne Besitz)
  • 74 enthalten Lokaldatensätze, und sind nicht in lobid.

I am not sure that titles without Lokalsatz should generally not be exported for lobid-resources...

acka47 commented 5 years ago

As we are dealing with way less than 2000 resources by now this is not crucial to fix. Closing until the problem pops up again.