Closed dr0i closed 4 years ago
How many files are in the fulldump archive?
$ tar -tf /data/DE-605/mabxml/baseline/DE-605-aleph-baseline-marcxchange-2019071300.tar.gz |wc -l
=> 20970208
by funny coincidence this is the exact number of the DE-605-aleph-baseline-marcxchange-2019072000.tar.gz
.
In the index this fulldump plus the daily updates make:
20946293
So in the index there are 23915 resources less than in the fulldump-archive. This is strange. One cause of this are probably doublettes, i.e. resources with a unique sys-number but the same HT-id. Would be nice to determine
"max" defines the internal "aleph-identifier" and must be raised from time to time (fulldump-hbz-aleph.sh)
[edit: grepped the biggest aleph-id, it was < 23M, so the next proposal naturally has no effect]:
BUT however, if the amount of the fulldumps is the same, why then the discrepancy after one week of daily updates and computingthe difference using the deletion index?
As there were two WARNs at weywot4 like this one:
[2019-07-21T01:01:27,093][WARN ][o.e.c.a.s.ShardStateAction] [weywot4] [resources-20190721-0100][1] received shard failed for shard id [[resources-20190721-0100][1]], allocation id [waMcL_LkTfqQmMTNcnYZug], primary term [1], message [mark copy as stale] and the es-cluster was up since March, we could:
- [X] restart elasticsearch cluster
Ok , I got something and the numbers are nearly fitting:
grep 'jsonMap without id' 20190714-0100-stage.startHbz02ToLobidResources.sh|wc -l 30695 grep 'jsonMap without id' 20190721-0100-stage.startHbz02ToLobidResources.sh|wc -l 33874
These documents are not indexed. There are 3179 more of these in the last fulldump. 3179 + 3143 (the deleted ones) = 6322 (which is near at the 6739 we are trying to explain). So going to see what these 'jsonMap without id' are about.
Re. number of MAB XML clobs, our colleague B.S. writes:
es befinden sich aktuell
20.982.606
Zeilen in der Tabelle hbz50.Z00P (Produktion, 23.07.2019 ca. 10:55 Uhr)
Dbib:
..._es2/de-605-digibib-20190706/_search?q=* | jq .hits.total 20974820
Colleagues from DigiBib found out the root of the problem:
Zwischen HT020128254 und HT020128255 findet ein Sprung der SYS-Nummern von 022936938 auf 036319972 statt. Die Export-Routine (oldAlephFetchBaseline) ist jedoch so konfiguriert, dass alle SYS-Nummern bis 24000000 durchlaufen werden. Ich kann die Konfiguration dahingehend ändern, dass das Limit (auf z.B. 80000000) erhöht wird, aber es ist zu prüfen, welche Auswirkungen dies auf den Verbundkatalog bzw. die Laufzeit des Jobs hätte.
By now, they ncreased the limit to 50000000. We should also do this.
Yesterday, we also got a report from an API user that resources are missing.
On 30.07.19 13:52, C.K. wrote:
wir haben wieder einmal unsere lokalen SunRise-Bestände mit den Lobiddaten, zu denen wir Bestand haben, verglichen. Dabei ist mir aufgefallen, dass einige neue Titel und Bestandsdaten von 11.07.2019 ff. zwar schon in der Verbunddatenbank verzeichnet sind, aber noch nicht in Lobid.
Hier einige Beispiele:
HT020128872, Aufnahmedatum Aleph 10.07.2019 HT020130244, Aufnahmedatum Aleph 11.07.2019 HT020129186, Aufnahmedatum Aleph 10.07.2019 Alle drei Titel sind noch nicht in den Lobid-Daten.
Strangely, all three titles are available from lobid.org/hbz01, see e.g. http://lobid.org/hbz01/HT020128872. Shouldn't it also be in the index then? (It currently is not, see http://lobid.org/resources/HT020128872.)
By now, they increased the limit to 50000000. We should also do this.
I've set the max
value to 50000000
in /home/hduser/git/hbz-aleph-dumping/bin/fulldump-hbz-aleph.sh
on weywot1
. Should be used when building the new full index over the weekend.
Yes, there was a surprisingly high jump concerning the SYS-number. The 50M should be safe. Data is back again.
The difference of resources of productive index vs staging is 8118. In between theses indices 3278 resources are deleted. The question ist: where are the 4840 missing resources?
(note that a week before something similar occured: 6739 resources less, but only 3143 where deleted.)