Closed TobiasNx closed 2 months ago
@acka47 says we have to check if dnb offers 10 min updates as rdf
As it turns out the problem is: DNB doesn't update their RDF data this frequently but only the MARC-XML - so we would have to write some ETL (where is the DNB's morph - we could reuse it?). Last comment from @TobiasNx was not shown as I made this claim. So go on check ! :+1:
"
Der Abfragezeitraum sollte nicht zu weit reichen, um eine Treffermenge über 100.000 Datensätzen zu vermeiden. Empfehlung bei nicht zeitkritischen Verfahren für Abfragezeitraum/Frequenz: 30 Minuten. Bei kleinen Sets (z. B. Online-Dissertationen) reicht ein einmal tägliches oder einmal wöchentliches Harvesting aus, da dadurch ein Datensatz, der in diesem Zeitraum mehrfach geändert wurde, nur einmal bezogen und die Treffermenge trotzdem nicht zu groß wird.
Wir empfehlen zudem als Wiederaufsetzzeitpunkt ("from") die Zeitangabe im Element "responseDate", z. B.
From: https://www.dnb.de/DE/Professionell/Metadatendienste/Datenbezug/OAI/oai_node.html
Also in Der Linked-Data-Service der Deutschen Nationalbibliothek: Auslieferung der Metadaten it reads:
Die RDF-Daten sind über die DNB-Schnittstellen OAI 24 , SRU 25 und den Datenshop 26 zu beziehen. Die auf diesen Wegen ausgelieferten Metadaten befinden sich auf dem aktuellen zeitlichen Stand.
So we should just try out shorter update intervals, I guess.
Hourly seems to be possible:
If #355 is merged the cron scheduler can be adjusted to get the data e.g. every 10 minutes.
As getting data every 10 minutes often results in an empty data set we disable sending emails that warns about empty data sets for now. We may want to furtherdiscuss this, e.g. implement a daily report or getting the OAI-PMH's server header resp. answer and work on these (i.e. ignore if the server reports <error code="noRecordsMatch"/>
).
Scheduled to get data every 10 minutes.
(Note that with every call a build is done via sbt
(takes 400% CPU for around 10 seconds) and this is even done 3 times (1. ConvertUpdates 2. Index updates to gnd-test 3. Index updates to gnd production). Would be nice to have a running webhook listener who would just start the process from an already running instance like in https://github.com/hbz/lobid-resources/issues/1159 ).
For the moment we test if the GND-updates.jsonl
is empty, and if so, ignoring the 2. and 3. sbt build (indexing).
We checked this and it seems to work. Got 7 new resources in the last 20 minutes ! We should blog and inform users.
We should also update http://lobid.org/gnd/dataset:
Datenbasis sind die RDF-Version der GND (täglich aktualisiert)
(hm, wondering why the data seems to be updated only every hour (not every 10 minutes, even if we try to get data all 10 minutes). Maybe the RDF dumps are not provided as often as the PICA-data ? If that's the case we should decrease getting data interval @acka47 .)
As we have just discussed in the review, we will schedule hourly updates.
Done scheduling hourly. Every_hour:40m. :+1: Note: blog when new full indexing is done.
None is willing to write the blog. As this is not mandatory, I am closing this issue here.
I think we still should do this.
We could keep it short:
Title: Hourly updates intervals for lobid-gnd
For lobid-gnd we fetch the GND as RDF-XML via OAI-PMH from the DNB and transform it to JSON-LD. At SWIB 2023 a colleague from UB Münster suggested to shorten our interval for ingesting OAI-PMH updates which we provided at a daily basis.
Now we are glad to announce that lobid-gnd provides hourly updates so you do not have to wait for the next day to get current GND data.
Have fun with it.
I've deployed it, see https://blog.lobid.org/.
By W.G. from UB Müster came the request to use a shorter interval for the updates. Daily updates would not be suffixient for them and DNB provides updates all 10 min.
Perhaps we adjust our updates updates even if we do not meet 10 min.