hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
24 stars 5 forks source link

Use shorter interval for updates #350

Closed TobiasNx closed 2 months ago

TobiasNx commented 9 months ago

By W.G. from UB Müster came the request to use a shorter interval for the updates. Daily updates would not be suffixient for them and DNB provides updates all 10 min.

Perhaps we adjust our updates updates even if we do not meet 10 min.

TobiasNx commented 9 months ago

@acka47 says we have to check if dnb offers 10 min updates as rdf

dr0i commented 9 months ago

As it turns out the problem is: DNB doesn't update their RDF data this frequently but only the MARC-XML - so we would have to write some ETL (where is the DNB's morph - we could reuse it?). Last comment from @TobiasNx was not shown as I made this claim. So go on check ! :+1:

TobiasNx commented 9 months ago

"

Der Abfragezeitraum sollte nicht zu weit reichen, um eine Treffermenge über 100.000 Datensätzen zu vermeiden. Empfehlung bei nicht zeitkritischen Verfahren für Abfragezeitraum/Frequenz: 30 Minuten. Bei kleinen Sets (z. B. Online-Dissertationen) reicht ein einmal tägliches oder einmal wöchentliches Harvesting aus, da dadurch ein Datensatz, der in diesem Zeitraum mehrfach geändert wurde, nur einmal bezogen und die Treffermenge trotzdem nicht zu groß wird. Wir empfehlen zudem als Wiederaufsetzzeitpunkt ("from") die Zeitangabe im Element "responseDate", z. B. 2017-08-30T08:12:54Z zu nutzen, da diese Zeitangabe der aktuellen Verfügbarkeit der Daten in unserem Repository am besten entspricht. Zusätzlich empfehlen wir das Harvesten mit einer geringen zeitlichen Überlappung ("responseDate" minus eine Minute = "from"). "

From: https://www.dnb.de/DE/Professionell/Metadatendienste/Datenbezug/OAI/oai_node.html

acka47 commented 9 months ago

Also in Der Linked-Data-Service der Deutschen Nationalbibliothek: Auslieferung der Metadaten it reads:

Die RDF-Daten sind über die DNB-Schnittstellen OAI 24 , SRU 25 und den Datenshop 26 zu beziehen. Die auf diesen Wegen ausgelieferten Metadaten befinden sich auf dem aktuellen zeitlichen Stand.

So we should just try out shorter update intervals, I guess.

TobiasNx commented 8 months ago

Hourly seems to be possible:

https://services.dnb.de/oai/repository?verb=ListIdentifiers&from=2023-10-10T07:08:23Z&until=2023-10-10T08:08:23Z&set=authorities&metadataPrefix=RDFxml (228 Records)

https://services.dnb.de/oai/repository?verb=ListIdentifiers&from=2023-10-10T08:08:23Z&until=2023-10-10T09:08:23Z&set=authorities&metadataPrefix=RDFxml (286)

https://services.dnb.de/oai/repository?verb=ListIdentifiers&from=2023-10-10T10:08:23Z&until=2023-10-10T11:08:23Z&set=authorities&metadataPrefix=RDFxml (332)

https://services.dnb.de/oai/repository?verb=ListIdentifiers&from=2023-10-10T11:08:23Z&until=2023-10-10T12:08:23Z&set=authorities&metadataPrefix=RDFxml (101)

https://services.dnb.de/oai/repository?verb=ListIdentifiers&from=2023-10-10T12:08:23Z&until=2023-10-10T13:08:23Z&set=authorities&metadataPrefix=RDFxml (0 Records)

dr0i commented 7 months ago

If #355 is merged the cron scheduler can be adjusted to get the data e.g. every 10 minutes.

dr0i commented 7 months ago

As getting data every 10 minutes often results in an empty data set we disable sending emails that warns about empty data sets for now. We may want to furtherdiscuss this, e.g. implement a daily report or getting the OAI-PMH's server header resp. answer and work on these (i.e. ignore if the server reports <error code="noRecordsMatch"/>).

dr0i commented 7 months ago

Scheduled to get data every 10 minutes. (Note that with every call a build is done via sbt (takes 400% CPU for around 10 seconds) and this is even done 3 times (1. ConvertUpdates 2. Index updates to gnd-test 3. Index updates to gnd production). Would be nice to have a running webhook listener who would just start the process from an already running instance like in https://github.com/hbz/lobid-resources/issues/1159 ). For the moment we test if the GND-updates.jsonl is empty, and if so, ignoring the 2. and 3. sbt build (indexing).

dr0i commented 7 months ago

We checked this and it seems to work. Got 7 new resources in the last 20 minutes ! We should blog and inform users.

dr0i commented 7 months ago

We should also update http://lobid.org/gnd/dataset:

Datenbasis sind die RDF-Version der GND (täglich aktualisiert)

(hm, wondering why the data seems to be updated only every hour (not every 10 minutes, even if we try to get data all 10 minutes). Maybe the RDF dumps are not provided as often as the PICA-data ? If that's the case we should decrease getting data interval @acka47 .)

acka47 commented 7 months ago

As we have just discussed in the review, we will schedule hourly updates.

dr0i commented 7 months ago

Done scheduling hourly. Every_hour:40m. :+1: Note: blog when new full indexing is done.

dr0i commented 2 months ago

None is willing to write the blog. As this is not mandatory, I am closing this issue here.

TobiasNx commented 2 months ago

I think we still should do this.

We could keep it short:

Title: Hourly updates intervals for lobid-gnd

For lobid-gnd we fetch the GND as RDF-XML via OAI-PMH from the DNB and transform it to JSON-LD. At SWIB 2023 a colleague from UB Münster suggested to shorten our interval for ingesting OAI-PMH updates which we provided at a daily basis.

Now we are glad to announce that lobid-gnd provides hourly updates so you do not have to wait for the next day to get current GND data.

Have fun with it.

dr0i commented 2 months ago

I've deployed it, see https://blog.lobid.org/.