hbz / lobid-gnd

UI and API to the Integrated Authority File (Gemeinsame Normdatei, GND)
http://lobid.org/gnd
Eclipse Public License 2.0
24 stars 5 forks source link

No updates for three days #317

Closed acka47 closed 1 year ago

acka47 commented 1 year ago

Since 2022-07-20 we get notifications that GND updates have failed. I will have to investigate whether this is a problem on our side or on at the data provider (DNB).

@thoffma Do you know anything about problems with the OAI-PMH updates for GND RDF?

acka47 commented 1 year ago

The DNB have switched on redicrection of HTTP requests to HTTPs on 2022-07-19 (see notification mail), one day prior to the first failed update notification.

We will probably only have to adjust the URLs for the havester.

dr0i commented 1 year ago

Did manual update like noted in the README. After some successful OAI-PMH harvesting with several resumptionTokens, it exited with:

[...]
resumptionToken: 
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown Source)
    at org.apache.xerces.dom.DeferredDocumentImpl.synchronizeChildren(Unknown Source)
    at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeChildren(Unknown Source)
    at org.apache.xerces.dom.ParentNode.hasChildNodes(Unknown Source)
    at org.apache.xerces.dom.DeepNodeListImpl.nextMatchingElementAfter(Unknown Source)
    at org.apache.xerces.dom.DeepNodeListImpl.item(Unknown Source)
    at org.joox.Impl.find(Impl.java:512)
    at org.joox.Impl.find(Impl.java:477)
    at org.joox.Impl.find(Impl.java:79)
    at apps.ConvertUpdates.process(ConvertUpdates.java:112)
    at apps.ConvertUpdates.getUpdates(ConvertUpdates.java:89)
    at apps.ConvertUpdates.getUpdatesAndConvert(ConvertUpdates.java:69)
[...]
[ERROR] [07/28/2022 16:59:54.784] [sbt-web-scheduler-1] [ActorSystem(sbt-web)] exception on LARS’ timer thread
java.lang.OutOfMemoryError: GC overhead limit exceeded

After some increasing steps of java heap space I ended up with java -Xmx16G. Nevertheless the GC overhead limit exceeded. However, the daily cronjob of updates at 1:00 am ended successfully. Also @acka47 could confirm that new data is available. So gonna close this.

dr0i commented 1 year ago

Reopening for @fsteeg to revisit re OutOfMemoryError and if this error is something to think about or just to be ignored.

fsteeg commented 1 year ago

Seems to still be an issue, see mail by E.V., 2022-08-09.

fsteeg commented 1 year ago

It seems the error resulted in the updates being incomplete. After manually getting them for a single day, the missing resources reported in the email above are now included. Todo:

fsteeg commented 1 year ago

So it seems that for one day, 2022-07-21, the updates contained more data than usual. I've re-fetched and indexed the other days. The OutOfMemoryError happens here:

https://github.com/hbz/lobid-gnd/blob/62384387ab9a49433e691ef4ac5a8f136555b458/app/apps/ConvertUpdates.java#L118

The error comes from the find part, not the write, so we should probably use a non-JOOX / standard Java approach to get the Description XML tag here.

fsteeg commented 1 year ago

I've switched the processing of the OAI-PMH response to the Java streaming API for XML. With this, getting the updates for 2022-07-21 worked. I've reindexed the result so now all updates since 2022-07-19 are indexed and future large updates should be no problem.

(Oddly, after indexing the 2022-07-21 data, the document count did not change. Perhaps the indexing for that day did work initially but failed later due to changing memory conditions on the server.)

Assigning @dr0i for review here and on the PR (https://github.com/hbz/lobid-gnd/pull/319).

dr0i commented 1 year ago

+1