Closed acka47 closed 1 year ago
The DNB have switched on redicrection of HTTP requests to HTTPs on 2022-07-19 (see notification mail), one day prior to the first failed update notification.
We will probably only have to adjust the URLs for the havester.
Did manual update like noted in the README. After some successful OAI-PMH harvesting with several resumptionTokens, it exited with:
[...]
resumptionToken:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.xerces.dom.DeferredDocumentImpl.getNodeObject(Unknown Source)
at org.apache.xerces.dom.DeferredDocumentImpl.synchronizeChildren(Unknown Source)
at org.apache.xerces.dom.DeferredElementNSImpl.synchronizeChildren(Unknown Source)
at org.apache.xerces.dom.ParentNode.hasChildNodes(Unknown Source)
at org.apache.xerces.dom.DeepNodeListImpl.nextMatchingElementAfter(Unknown Source)
at org.apache.xerces.dom.DeepNodeListImpl.item(Unknown Source)
at org.joox.Impl.find(Impl.java:512)
at org.joox.Impl.find(Impl.java:477)
at org.joox.Impl.find(Impl.java:79)
at apps.ConvertUpdates.process(ConvertUpdates.java:112)
at apps.ConvertUpdates.getUpdates(ConvertUpdates.java:89)
at apps.ConvertUpdates.getUpdatesAndConvert(ConvertUpdates.java:69)
[...]
[ERROR] [07/28/2022 16:59:54.784] [sbt-web-scheduler-1] [ActorSystem(sbt-web)] exception on LARS’ timer thread
java.lang.OutOfMemoryError: GC overhead limit exceeded
After some increasing steps of java heap space I ended up with java -Xmx16G
. Nevertheless the GC overhead limit exceeded
. However, the daily cronjob of updates at 1:00 am ended successfully. Also @acka47 could confirm that new data is available.
So gonna close this.
Reopening for @fsteeg to revisit re OutOfMemoryError
and if this error is something to think about or just to be ignored.
Seems to still be an issue, see mail by E.V., 2022-08-09.
It seems the error resulted in the updates being incomplete. After manually getting them for a single day, the missing resources reported in the email above are now included. Todo:
So it seems that for one day, 2022-07-21, the updates contained more data than usual. I've re-fetched and indexed the other days. The OutOfMemoryError happens here:
The error comes from the find
part, not the write
, so we should probably use a non-JOOX / standard Java approach to get the Description
XML tag here.
I've switched the processing of the OAI-PMH response to the Java streaming API for XML. With this, getting the updates for 2022-07-21 worked. I've reindexed the result so now all updates since 2022-07-19 are indexed and future large updates should be no problem.
(Oddly, after indexing the 2022-07-21 data, the document count did not change. Perhaps the indexing for that day did work initially but failed later due to changing memory conditions on the server.)
Assigning @dr0i for review here and on the PR (https://github.com/hbz/lobid-gnd/pull/319).
+1
Since 2022-07-20 we get notifications that GND updates have failed. I will have to investigate whether this is a problem on our side or on at the data provider (DNB).
@thoffma Do you know anything about problems with the OAI-PMH updates for GND RDF?