ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

ClamAV processor should timeout #12

Closed anjackson closed 6 years ago

anjackson commented 6 years ago

We've seen an issue in production with hanging socket connections interfering with crawl ops.

[ToeThread #54: http://www.wymondhamandattleboroughmercury.co.uk/news/greening-wymondham-big-litter-pick-2018-1-5469488?action=login
 CrawlURI http://www.wymondhamandattleboroughmercury.co.uk/news/greening-wymondham-big-litter-pick-2018-1-5469488?action=login LLL http://www.wymondhamandattleboroughmercury.co.uk/news/greening-wymondham-big-litter-pick-2018-1-5469488    0 attempts
    in processor: viralContent
    ACTIVE for 7d21h23m42s277ms
    step: ABOUT_TO_BEGIN_PROCESSOR for 7d21h23m41s42ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    java.net.SocketInputStream.read(SocketInputStream.java:171)
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    java.net.SocketInputStream.read(SocketInputStream.java:127)
    uk.bl.wap.util.ClamdScanner.getResponse(ClamdScanner.java:136)
    uk.bl.wap.util.ClamdScanner.clamdSession(ClamdScanner.java:105)
    uk.bl.wap.util.ClamdScanner.clamdScan(ClamdScanner.java:51)
    uk.bl.wap.crawler.processor.ViralContentProcessor.innerProcess(ViralContentProcessor.java:88)
    org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    org.archive.modules.Processor.process(Processor.java:142)
    org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
]

We should check that the ViralContentProcessor will time out after some reasonable time (a few mins).

anjackson commented 6 years ago

There is already a clamdTimeout setting to control this, but it defaults to 0 (no timeout). Therefore this is a crawler beans configuration issue.

anjackson commented 6 years ago

Running this Groovy script in a H3 console works, but the change needs deploying to the configuration files.

rawOut.println(appCtx.getBean("viralContent").getClamdTimeout())
appCtx.getBean("viralContent").setClamdTimeout(60*1000)
rawOut.println(appCtx.getBean("viralContent").getClamdTimeout())
anjackson commented 6 years ago

Okay, config updated.