SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #989: dns:007bond.co.uk [Fri Jul 12 20:13:38 GMT 2019]
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at org.archive.crawler.reporting.StatisticsTracker.saveSourceStats(StatisticsTracker.java:767)
at org.archive.crawler.reporting.StatisticsTracker.crawledURISuccessful(StatisticsTracker.java:760)
at org.archive.crawler.reporting.StatisticsTracker.onApplicationEvent(StatisticsTracker.java:986)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:97)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:303)
at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:977)
at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:576)
at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)
Which happened a lot and killed all the ToeThreads. This was because, in this code:
Looking at what happened, this appears to arise on DNS records of URLs discovered via the robots.txt. e.g. these events (that appeared to occur out of order, as the hosts are different).
Just had a large crawl die horribly because of:
Which happened a lot and killed all the ToeThreads. This was because, in this code:
https://github.com/internetarchive/heritrix3/blob/aa705bef2eb8fbbb9aeb56995e713a7b6ba0ed00/engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java#L759-L760
If the
A_SOURCE_TAG
is set, but the value isnull
, it tries to use thenull
later and ConcurrentHashMap throws NPE on anull
keyLooking at what happened, this appears to arise on DNS records of URLs discovered via the
robots.txt
. e.g. these events (that appeared to occur out of order, as the hosts are different).The problem appears to be that you must only copy over the
SourceTag
if it's notnull
. Hence, in theRobotsTxtSitemapExtractor
, this was wrong:but this should be fine: