ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

NPE in StatisticsTracker because of `null` `SourceTag` #49

Closed anjackson closed 4 years ago

anjackson commented 4 years ago

Just had a large crawl die horribly because of:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #989: dns:007bond.co.uk [Fri Jul 12 20:13:38 GMT 2019]
java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
        at org.archive.crawler.reporting.StatisticsTracker.saveSourceStats(StatisticsTracker.java:767)
        at org.archive.crawler.reporting.StatisticsTracker.crawledURISuccessful(StatisticsTracker.java:760)
        at org.archive.crawler.reporting.StatisticsTracker.onApplicationEvent(StatisticsTracker.java:986)
        at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:97)
        at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:303)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:977)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:576)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

Which happened a lot and killed all the ToeThreads. This was because, in this code:

https://github.com/internetarchive/heritrix3/blob/aa705bef2eb8fbbb9aeb56995e713a7b6ba0ed00/engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java#L759-L760

If the A_SOURCE_TAG is set, but the value is null, it tries to use the null later and ConcurrentHashMap throws NPE on a null key

Looking at what happened, this appears to arise on DNS records of URLs discovered via the robots.txt. e.g. these events (that appeared to occur out of order, as the hosts are different).

2019-07-12T20:13:35.440Z     1         63 dns:007bond.co.uk LP http://007bond.co.uk/sitemap.xml text/dns #989 20190712201335399+38 sha1:ISZ7R2PFKOMBUTNRCOCJPFNKCYBM54HB - - {"warcFilename":"BL-NPLD-20190712194016457-10688-71~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":660940299,"scopeDecision":"ACCEPT by rule #14 PrerequisiteAcceptDecideRule","warcFileRecordLength":243}
2019-07-12T20:13:35.305Z   200        200 http://www.007bond.co.uk/robots.txt P http://www.007bond.co.uk/ text/plain #233 20190712201335014+34 sha1:OOJFFALMLEBE7RK6362FV6YQHHYRKHYJ - ip:77.104.133.250 {"contentSize":619,"warcFilename":"BL-NPLD-20190712195333571-10698-71~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":516684741,"scopeDecision":"ACCEPT by rule #14 PrerequisiteAcceptDecideRule","warcFileRecordLength":2429}

The problem appears to be that you must only copy over the SourceTag if it's not null. Hence, in the RobotsTxtSitemapExtractor, this was wrong:

            curiClone.setSourceTag(curi.getSourceTag());

but this should be fine:

            if (curi.getSourceTag() != null) {
                curiClone.setSourceTag(curi.getSourceTag());
            }