hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
25 stars 5 forks source link

Don't write GND-lastSuccessfulUpdate.txt and send alert mail if exceptions were caught #338

Closed fsteeg closed 1 year ago

fsteeg commented 1 year ago

Since we don't abort our transformations on errors any more, but skip them instead, we have to make sure that the GND-lastSuccessfulUpdate.txt is written only if there were no exceptions, in order for the updates to be retried the next day (we noticed this due to a report about missing updates, seems it was a temporary 502 from the OAI-PMH proxy, worked when re-running the updates now). The existing retry-and-mail logic doesn't seem to work, probably due to the exception handling (not thrown, no retry, no mail?).

fsteeg commented 1 year ago

Since we don't abort our transformations on errors any more, but skip them instead [...]

It seems this is not actually related. I was thinking of ed7e70a, but the thing I saw in the log [1] seems unrelated, since it's happening before running the Metafacture workflow, in ConvertUpdates:

java.io.IOException: Server returned HTTP response code: 502 for URL: http://gnd-proxy.lobid.org/oai/repository?verb=ListRecords&resumptionToken=hhGHeNpNjMsJgDAQBStxOxD2JdkkexcLsANFwYAi-AHL14iIcxsYBj4A3hoBK4qmqs95ovbYx2VNexo2qgkqTLg79sZx5jG4GDRDxb_HO4zhC_NeIhQOQugRQ2e5tIZAYlkvk6YfqQ
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1944)
    at sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1939)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1938)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1508)
    at sun.net.www.protocol.http.HttpURLConnection.access$200(HttpURLConnection.java:91)
    at sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1484)
    at sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1482)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.AccessController.doPrivilegedWithCombiner(AccessController.java:782)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1481)
    at ORG.oclc.oai.harvester2.verb.HarvesterVerb.harvest(HarvesterVerb.java:249)
    at ORG.oclc.oai.harvester2.verb.HarvesterVerb.<init>(HarvesterVerb.java:183)
    at ORG.oclc.oai.harvester2.verb.ListRecords.<init>(ListRecords.java:67)
    at ORG.oclc.oai.harvester2.app.RawWrite.run(RawWrite.java:141)
    at apps.ConvertUpdates.process(ConvertUpdates.java:129)

Also, this happened for the 2022-12-15 update, so it's unrelated to the problem that triggered this issue (which happened on 2022-12-19). For this 2022-12-15 problem, there seems to be some issue with the dates used when retrying:

Calling OAI-PMH at http://gnd-proxy.lobid.org/oai/repository from 2022-12-16 until 2022-12-16
Found errors
[error: null]
Error record: <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2022-12-16T00:00:28Z</responseDate><request from="2022-12-16" metadataPrefix="RDFxml" set="authorities" until="2022-12-16" verb="ListRecords">https://services.dnb.de/oai/repository</request><error code="noRecordsMatch"/></OAI-PMH>
Tried 1 times to get the data, but it's empty.
Going to retry in 240 min
Calling OAI-PMH at http://gnd-proxy.lobid.org/oai/repository from 2022-12-15 until 2022-12-15

It seems like after an error when getting 2022-12-16, we're getting 2022-12-15 to recover. Looking into the issue on 2022-12-19, there also seems to be a problem related to the date, though different (calling from 2022-12-20 until 2022-12-19):

Calling OAI-PMH at http://gnd-proxy.lobid.org/oai/repository from 2022-12-20 until 2022-12-19
resumptionToken: 
Found errors
[error: null]
Error record: <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2022-12-21T00:00:56Z</responseDate><request from="2022-12-20" metadataPrefix="RDFxml" set="authorities" until="2022-12-19" verb="ListRecords">https://services.dnb.de/oai/repository</request><error code="noRecordsMatch">The value of argument until '2022-12-19T23:59:59Z' lies before argument from '2022-12-20T00:00:00Z'</error></OAI-PMH>
Calling OAI-PMH at http://gnd-proxy.lobid.org/oai/repository from 2022-12-21 until 2022-12-21

So I think these are the tasks here:

[1] Looking at output of e.g. quaoar1:~/git/lobid-gnd$ tail -n 30000 logs/cron.sh.log | grep -B 50 -A 50 2022-12-19

fsteeg commented 1 year ago

Looks like these were the days with errors during updates since 2022-10-13 (the data state of the latest baseline, see https://data.dnb.de/opendata/, "Stand:"), collected by looking at the output of quaoar1:~/git/lobid-gnd$ tail -n 3000000 logs/cron.sh.log | grep "HTTP response code\|Calling OAI-PMH":

fsteeg commented 1 year ago

So it seems the interval setting for multi-day updates was broken since 6e554aa ("Tweak end date for updates to avoid redundant data", 2022-12-09), resulting in the end date not being incremented. Fixed that in 0d7602f.

It seems the result was that when a single day update failed for some other reason, the following multi-day update failed too, for this reason. After my recent manual updates, the GND-lastSuccessfulUpdate.txt was set to some date in the past (the end date of my last manual update), triggering this problem, resulting in missing updates again. I've deployed 0d7602f to quaoar1 and ran the multi-day update since 2022-12-20 to restore the data and test the fix, looks good:

The reason we did not get an alert for the second issue (the follow-up error, caused by the faulty end date) seems to be that we only retry and alert if we got no data at all, so if e.g. the first day of multi-day updates worked, but the second fails, we get no alert. Opened #339 to improve that.

Assigning @dr0i for review, won't open a PR, changes are only 07d7b3b & 0d7602f.