ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Crawler pauses when processing large numbers of candidate URLs #54

Open anjackson opened 4 years ago

anjackson commented 4 years ago

The whole disposition process is currently synchronised, and locks the frontier while the candidate URLs from a given CrawlURI are processed. i.e. lots of

Java Thread State: WAITING
Blocked/Waiting On: java.util.concurrent.locks.ReentrantReadWriteLock$FairSync@cb14bb8 which is owned by org.archive.crawler.frontier.BdbFrontier@275e60e2.managerThread(99)
    sun.misc.Unsafe.park(Native Method)
    java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lockInterruptibly(ReentrantReadWriteLock.java:772)
    org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:455)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)

...while there's a... ACTIVE for 32m59s225ms step: ABOUT_TO_BEGIN_PROCESSOR for 32m51s640ms Java Thread State: RUNNABLE Blocked/Waiting On: NONE java.util.regex.Pattern$CharProperty.match(Pattern.java:3790) java.util.regex.Pattern$Curly.match0(Pattern.java:4274) java.util.regex.Pattern$Curly.match(Pattern.java:4248) java.util.regex.Pattern$Begin.match(Pattern.java:3539) java.util.regex.Matcher.match(Matcher.java:1270) java.util.regex.Matcher.matches(Matcher.java:604) org.archive.modules.deciderules.MatchesListRegexDecideRule.evaluate(MatchesListRegexDecideRule.java:94) org.archive.modules.deciderules.PredicatedDecideRule.innerDecide(PredicatedDecideRule.java:48) org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60) org.archive.modules.deciderules.DecideRuleSequence.innerDecide(DecideRuleSequence.java:113) org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60) org.archive.crawler.framework.Scoper.isInScope(Scoper.java:107) org.archive.crawler.prefetch.CandidateScoper.innerProcessResult(CandidateScoper.java:45) org.archive.modules.Processor.process(Processor.java:142) org.archive.modules.ProcessorChain.process(ProcessorChain.java:131) org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176) org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230) org.archive.modules.Processor.innerProcessResult(Processor.java:175) org.archive.modules.Processor.process(Processor.java:142) org.archive.modules.ProcessorChain.process(ProcessorChain.java:131) org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)



Because the current sitemap extraction avoids capping the number of outlinks (for completeness), very large sitemaps lead to hangs (e.g >30 mins) while all the candidates are processed. This is made worse by the fact that we need to refer to OutbackCDX to check it we need to revisit a URL.

We could consider capping the outlinks from sitemaps, but use reservoir sampling so we get a different random subset each time?