Closed anjackson closed 5 years ago
So, re-using the beginDisposition
/endDisposition
lock used by the ToeThreads
should resolve this issue, although locking the whole frontier seems likely to cause contention and slow things down somewhat. However, I can see no way to implement WorkQueue
-level locking on the current frontier implementation.
Note that my multi-threaded CrawlMessageHandler
was also interfering with itself, for the same reason, which was the cause of the ConcurrentModificationException
errors.
Re-running the domain crawl now and fortunately the slow-down seems fairly minor. However, #ToeThreads still not stable. After a day of crawling 27 of 1,600 ToeThreads have died.
Oh dear, still not got it:
SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #82: https://irp-cdn.multiscreensite.com/af541788/dms3rep/multi/mobile/wash-n-fold_icon-113x113.svg [Tue Aug 28 20:37:21 GMT 2018]
java.lang.NullPointerException
at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)
Hmm, there are other points where WorkQueue.makeDirty
is called, which might be the cause. For example, when the controller asks for the next()
URI any ready queue with no elements in it could be marked with noteExausted()
which then could call makeDirty()
at the wrong moment. Could track calls to makeDirty()
, perhaps?
It does seem that the rate of ToeThread death is somewhat slower than before, but it's hard to remember and Prometheus appears not to have discarded the old from the last run, somehow.
Looking back at older logs, the thread death rate was much worse before (e.g. around 30 out of 100 ToeThreads suffered a fatal error in the first day of the crawl). So these changes do appear to have helped, but some other part of the code is still interfering.
Okay, this appears to be resolved by https://github.com/internetarchive/heritrix3/pull/213 so closing here, may take longer to close upstream.
We're seeing really odd fatal errors, killing off ToeThreads in crawls:
Looking at the code, this shouldn't really be possible!
Going up the call tree, it appears the
peekItem
has become inconsistent with, i.e. reset tonull
.Note that NetArchive Suite have also seen this issue and patched it in this way.
Also observing
So, what seems to be happening, I think, is that occasionally, between this statement and this one, the
WorkQueue
gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. AspeekItem
istransient
, flushing it out to the disk and back drops the value and we're left with anull
.