ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Odd 304 errors #27

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

We're seeing

fc_heritrix-worker.1.tewka74xob8a@crawler02    | SEVERE: org.archive.crawler.framework.ToeThread recoverableProblem Problem java.lang.NullPointerException occurred when trying to process 'https://www.dailymail.co.uk/reader-comments/p/comment/link/383687131' at step ABOUT_TO_BEGIN_PROCESSOR in
fc_heritrix-worker.1.tewka74xob8a@crawler02    |  [Mon Jan 21 12:26:49 GMT 2019]
fc_heritrix-worker.1.tewka74xob8a@crawler02    | java.lang.NullPointerException
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.recrawl.FetchHistoryProcessor.innerProcess(FetchHistoryProcessor.java:111)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.Processor.process(Processor.java:142)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |

Which is weird, because we've never downloaded them (there's no fetch history), which means we should not see a 304 condition. From the CLI I got a 403 access denied for that URL (which works in the browser), so maybe this is a problem with the site behaviour.

It's possible that H3 is retrying downloads, and that the HTTP client is persisting some state that gets passed in subsequent requests? Leading to a 304?

AFAIK H3 should not see HTTP 304 because it doesn't send If-Modified-Since or If-None-Match header.

anjackson commented 5 years ago

Should be resolved via https://github.com/internetarchive/heritrix3/issues/229 and https://github.com/internetarchive/heritrix3/pull/230

anjackson commented 5 years ago

👍