hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
24 stars 5 forks source link

Inconsistencies / missing data in automatic updates #372

Closed fsteeg closed 4 weeks ago

fsteeg commented 5 months ago

Via email feedback, original message on 12/22/23 12:38 by M.H.

New entry was missing in lobid-gnd:

https://services.dnb.de/oai/repository?verb=GetRecord&metadataPrefix=RDFxml&identifier=oai:dnb.de/authorities/1312101741

Latest update is now on 2023-12-27T11:12:51.000, which is 2023-12-27T10:12:51Z in OAI-PMH, as clarified by DNB via email on 12/22/23, 17:13 by J.R.

Fetching updates manually worked, the missing resource is now in lobid-gnd:

https://lobid.org/gnd/1312101741

However, the automatic update for that time span on the server is way too small:

sol@quaoar1:~/git/lobid-gnd$ ls -alh data/backup/GND-updates_2023-12-27T09:40:26Z_2023-12-27T10:40:25Z.*
1.2K Dec 27 10:40 data/backup/GND-updates_2023-12-27T09:40:26Z_2023-12-27T10:40:25Z.jsonl
3.8K Dec 27 10:40 data/backup/GND-updates_2023-12-27T09:40:26Z_2023-12-27T10:40:25Z.rdf

Compared to the manual run for the same time span (sol@quaoar3:~/git/lobid-gnd$ sbt "runMain apps.ConvertUpdates 2023-12-27T09:40:26Z 2023-12-27T10:40:25Z"):

sol@quaoar3:~/git/lobid-gnd$ ls -alh GND-updates.*
143K Jan  8 12:34 GND-updates.jsonl
369K Jan  8 12:34 GND-updates.rdf

Might have been temporary network issues, but at least we need better monitoring.

dr0i commented 5 months ago

Might be related to https://github.com/hbz/lobid-gnd/issues/363.

witzigs commented 5 months ago

Hi, We link to lobid-gnd on a search interface (swisscollections.ch) and I got a question why the links for two GND records don't return results on lobid-gnd. Both have been added to the GND on 14.12.2023, as far as I know: https://services.dnb.de/oai/repository?verb=GetRecord&metadataPrefix=RDFxml&identifier=oai:dnb.de/authorities/1312495189 https://services.dnb.de/oai/repository?verb=GetRecord&metadataPrefix=RDFxml&identifier=oai:dnb.de/authorities/1312496002

Reading this issue, I assume that the records weren't added to lobid-gnd due to these update issues. So I just add this feedback here in case more examples help you fix the issue.

Best regards, Silvia

fsteeg commented 4 months ago

I got a question why the links for two GND records don't return results on lobid-gnd

I reindexed those two (underlying issue is still unresolved):

https://lobid.org/gnd/1312495189 https://lobid.org/gnd/1312496002

dr0i commented 4 months ago

Couldn't find the underlying problem. Especially strange is that the automatic updates somtimes are smaller than the later manually invoked one. As @fsteeg mentioned this could be a temporarily network issue. There might also be a problem on the side of the provider. Because of this hardly debugable problem and also to cope with possible problem at provider side I suggest to do also a daily update in addendum to the hourly updates. This way we have should be more safe to get all the data. If agreed I will configure to also have a daily update. Or better ideas?

acka47 commented 4 months ago

Because of this hardly debugable problem and also to cope with possible problem at provider side I suggest to do also a daily update in addendum to the hourly updates. This way we have should be more safe to get all the data. If agreed I will configure to also have a daily update.

+1 This sound like a good approach to me. Isn't it so that the number of reports has risen since we switched to hourly updates in November (#350)? The question is whether it is a good idea in the first place to have hourly updates if people can not rely on them being carried out reliably.

dr0i commented 4 months ago

378 works as a safety rope.

Why we have sometimes (as in "seldom") trouble to get the whole data hourly remains to be a puzzle. It could be interesting to ask dnb if they notice issues on their side re oaipmh service and data syncing.

fsteeg commented 4 months ago

(+1 for additional daily updates, I've approved #378)

Why we have sometimes (as in "seldom") trouble to get the whole data hourly remains to be a puzzle. It could be interesting to ask dnb if they notice issues on their side re oaipmh service and data syncing.

Could this in some way be related to the fact that the OAI-PMH interface expects UTC times, while the modification times in the data and the server use local time (see mail from J.R. on 2023-12-22)?

dr0i commented 4 months ago

From German Wikipedia:

Addiert man eine Stunde zur UTC, erhält man die Mitteleuropäische Zeit (MEZ), die zeitweise in Deutschland, Österreich, der Schweiz und anderen mitteleuropäischen Staaten gilt. Für die im Sommer geltende Mitteleuropäische Sommerzeit (MESZ) sind zwei Stunden zu addieren.

So indeed: if we query what we think starts last hour to now (MEZ) we query in fact just now to next hour (UTC). Wondering why there was data at all. Going to fix it.

dr0i commented 4 months ago

Should be fixed with #379 "from now on". I.e. I assume a complete reindexing is needed to catch up with all the possibly missing data @fsteeg ?

acka47 commented 4 months ago

A new dump should be provided soon:

image

Source: https://www.dnb.de/DE/Professionell/Metadatendienste/Datenbezug/Gesamtabzuege/gesamtabzuege_node.html#doc58272bodyText2

acka47 commented 4 months ago

We received a mail yesterday about missing records that were created last week. Example: https://lobid.org/gnd/1319507522

Creation date (see MARC) is: 2024-02-15

acka47 commented 4 months ago

We received a mail yesterday about missing records that were created last week.

The example (https://lobid.org/gnd/1319507522) now works and I sent out a mail response.

acka47 commented 4 months ago

E.V. who sent the mail mentioned in https://github.com/hbz/lobid-gnd/issues/372#issuecomment-1960852456 followed up on it by providing more entries that are still missing. I went through them to see on which day they were created and found entries from the following days:

He closes the email with the notion that the list is not exhaustive and more entries are missing. As the impact of the missing updates is significantly downgrading the service we should not wait for a new full dump but reindex titles – probably best starting at 2023-11-10 as this is the date we have rescheduled the updates (see #350).

fsteeg commented 4 months ago

Reindexed updates since 2023-11-10, here are some of the examples from the mail:

https://lobid.org/gnd/1317861825 https://lobid.org/gnd/1317650069 https://lobid.org/gnd/1317239962 https://lobid.org/gnd/1317238400 https://lobid.org/gnd/1317163133 https://lobid.org/gnd/1317151534 https://lobid.org/gnd/1316984184

acka47 commented 4 months ago

+1 It's ok for me to close this issue now but we should monitor closely whether updates reliably come in .

acka47 commented 4 weeks ago

Closing. Updates have been fine during the last weeks/months.