hbz / lobid-resources

Transformation, web frontend, and API for the hbz catalog as LOD
http://lobid.org/resources
Eclipse Public License 2.0
7 stars 7 forks source link

Alias switch failed: missing resources #1007

Closed dr0i closed 4 years ago

dr0i commented 4 years ago

The difference of resources of productive index vs staging is 8118. In between theses indices 3278 resources are deleted. The question ist: where are the 4840 missing resources?

(note that a week before something similar occured: 6739 resources less, but only 3143 where deleted.)

dr0i commented 4 years ago

How many files are in the fulldump archive? $ tar -tf /data/DE-605/mabxml/baseline/DE-605-aleph-baseline-marcxchange-2019071300.tar.gz |wc -l => 20970208 by funny coincidence this is the exact number of the DE-605-aleph-baseline-marcxchange-2019072000.tar.gz. In the index this fulldump plus the daily updates make: 20946293 So in the index there are 23915 resources less than in the fulldump-archive. This is strange. One cause of this are probably doublettes, i.e. resources with a unique sys-number but the same HT-id. Would be nice to determine

[edit: grepped the biggest aleph-id, it was < 23M, so the next proposal naturally has no effect]:

BUT however, if the amount of the fulldumps is the same, why then the discrepancy after one week of daily updates and computingthe difference using the deletion index?

As there were two WARNs at weywot4 like this one:

[2019-07-21T01:01:27,093][WARN ][o.e.c.a.s.ShardStateAction] [weywot4] [resources-20190721-0100][1] received shard failed for shard id [[resources-20190721-0100][1]], allocation id [waMcL_LkTfqQmMTNcnYZug], primary term [1], message [mark copy as stale] and the es-cluster was up since March, we could:

  • [X] restart elasticsearch cluster
dr0i commented 4 years ago

Ok , I got something and the numbers are nearly fitting:

grep 'jsonMap without id' 20190714-0100-stage.startHbz02ToLobidResources.sh|wc -l 30695 grep 'jsonMap without id' 20190721-0100-stage.startHbz02ToLobidResources.sh|wc -l 33874

These documents are not indexed. There are 3179 more of these in the last fulldump. 3179 + 3143 (the deleted ones) = 6322 (which is near at the 6739 we are trying to explain). So going to see what these 'jsonMap without id' are about.

acka47 commented 4 years ago

Re. number of MAB XML clobs, our colleague B.S. writes:

es befinden sich aktuell

20.982.606

Zeilen in der Tabelle hbz50.Z00P (Produktion, 23.07.2019 ca. 10:55 Uhr)

dr0i commented 4 years ago

Dbib:

..._es2/de-605-digibib-20190706/_search?q=* | jq .hits.total 20974820

acka47 commented 4 years ago

Colleagues from DigiBib found out the root of the problem:

Zwischen HT020128254 und HT020128255 findet ein Sprung der SYS-Nummern von 022936938 auf 036319972 statt. Die Export-Routine (oldAlephFetchBaseline) ist jedoch so konfiguriert, dass alle SYS-Nummern bis 24000000 durchlaufen werden. Ich kann die Konfiguration dahingehend ändern, dass das Limit (auf z.B. 80000000) erhöht wird, aber es ist zu prüfen, welche Auswirkungen dies auf den Verbundkatalog bzw. die Laufzeit des Jobs hätte.

By now, they ncreased the limit to 50000000. We should also do this.

acka47 commented 4 years ago

Yesterday, we also got a report from an API user that resources are missing.

On 30.07.19 13:52, C.K. wrote:

wir haben wieder einmal unsere lokalen SunRise-Bestände mit den Lobiddaten, zu denen wir Bestand haben, verglichen. Dabei ist mir aufgefallen, dass einige neue Titel und Bestandsdaten von 11.07.2019 ff. zwar schon in der Verbunddatenbank verzeichnet sind, aber noch nicht in Lobid.

Hier einige Beispiele:

HT020128872, Aufnahmedatum Aleph 10.07.2019 HT020130244, Aufnahmedatum Aleph 11.07.2019 HT020129186, Aufnahmedatum Aleph 10.07.2019 Alle drei Titel sind noch nicht in den Lobid-Daten.

Strangely, all three titles are available from lobid.org/hbz01, see e.g. http://lobid.org/hbz01/HT020128872. Shouldn't it also be in the index then? (It currently is not, see http://lobid.org/resources/HT020128872.)

fsteeg commented 4 years ago

By now, they increased the limit to 50000000. We should also do this.

I've set the max value to 50000000 in /home/hduser/git/hbz-aleph-dumping/bin/fulldump-hbz-aleph.sh on weywot1. Should be used when building the new full index over the weekend.

dr0i commented 4 years ago

Yes, there was a surprisingly high jump concerning the SYS-number. The 50M should be safe. Data is back again.