ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Check for problems with metadata WARC records #51

Open anjackson opened 4 years ago

anjackson commented 4 years ago

As a consequence of #49 we had this kind of thing happen during the writing of the WARC metadata record:

SEVERE: org.archive.crawler.framework.ToeThread recoverableProblem Problem java.lang.NullPointerException occurred when trying to process 'http://armadillonv.co.uk/robots.txt' at step ABOUT_TO_BEGIN_PROCESSOR in
 [Sun Jul 14 16:28:20 GMT 2019]
java.lang.NullPointerException
        at org.archive.util.anvl.Value.baseCheck(Value.java:44)
        at org.archive.util.anvl.SubElement.<init>(SubElement.java:34)
        at org.archive.util.anvl.Value.<init>(Value.java:40)
        at org.archive.util.anvl.ANVLRecord.addLabelValue(ANVLRecord.java:86)
        at org.archive.modules.writer.WARCWriterProcessor.writeMetadata(WARCWriterProcessor.java:720)
        at org.archive.modules.writer.WARCWriterProcessor.writeHttpRecords(WARCWriterProcessor.java:448)
        at org.archive.modules.writer.WARCWriterProcessor.write(WARCWriterProcessor.java:238)
        at org.archive.modules.writer.WARCWriterProcessor.innerProcessResult(WARCWriterProcessor.java:194)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

we should check if this caused any significant problems (especially if it goes beyond just damaging/losing those specific records).