NullPointerExceptions killing ToeThreads

anjackson commented 5 years ago

We're seeing really odd fatal errors, killing off ToeThreads in crawls:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #99: https://www.andersonslimited.co.uk/robots.txt [Wed Jul 25 09:08:01 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

Looking at the code, this shouldn't really be possible!

Going up the call tree, it appears the peekItem has become inconsistent with, i.e. reset to null.

Note that NetArchive Suite have also seen this issue and patched it in this way.

Also observing

java.util.ConcurrentModificationException
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1247)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1242)
        at com.esotericsoftware.kryo.serialize.MapSerializer.writeObjectData(MapSerializer.java:90)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
        at com.esotericsoftware.kryo.Kryo.writeObjectData(Kryo.java:453)
        at com.esotericsoftware.kryo.ObjectBuffer.writeObjectData(ObjectBuffer.java:262)
        at org.archive.bdb.KryoBinding.objectToEntry(KryoBinding.java:81)
        at com.sleepycat.collections.DataView.useValue(DataView.java:549)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:817)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:122)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:379)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:690)
        at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
        at org.archive.crawler.frontier.AbstractFrontier.doJournalAdded(AbstractFrontier.java:647)
        at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:410)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:333)
        at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:554)
        at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:378)
        at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:356)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
        at uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageHandler.run(KafkaUrlReceiver.java:468)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)```

So, what seems to be happening, I think, is that occasionally, between this statement and this one, the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

anjackson commented 5 years ago

So, re-using the beginDisposition/endDisposition lock used by the ToeThreads should resolve this issue, although locking the whole frontier seems likely to cause contention and slow things down somewhat. However, I can see no way to implement WorkQueue-level locking on the current frontier implementation.

anjackson commented 5 years ago

Note that my multi-threaded CrawlMessageHandler was also interfering with itself, for the same reason, which was the cause of the ConcurrentModificationException errors.

anjackson commented 5 years ago

Re-running the domain crawl now and fortunately the slow-down seems fairly minor. However, #ToeThreads still not stable. After a day of crawling 27 of 1,600 ToeThreads have died.

anjackson commented 5 years ago

Oh dear, still not got it:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #82: https://irp-cdn.multiscreensite.com/af541788/dms3rep/multi/mobile/wash-n-fold_icon-113x113.svg [Tue Aug 28 20:37:21 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

anjackson commented 5 years ago

Hmm, there are other points where WorkQueue.makeDirty is called, which might be the cause. For example, when the controller asks for the next() URI any ready queue with no elements in it could be marked with noteExausted() which then could call makeDirty() at the wrong moment. Could track calls to makeDirty(), perhaps?

It does seem that the rate of ToeThread death is somewhat slower than before, but it's hard to remember and Prometheus appears not to have discarded the old from the last run, somehow.

anjackson commented 5 years ago

Looking back at older logs, the thread death rate was much worse before (e.g. around 30 out of 100 ToeThreads suffered a fatal error in the first day of the crawl). So these changes do appear to have helped, but some other part of the code is still interfering.

anjackson commented 5 years ago

Okay, this appears to be resolved by https://github.com/internetarchive/heritrix3/pull/213 so closing here, may take longer to close upstream.

ukwa / ukwa-heritrix

NullPointerExceptions killing ToeThreads #16