ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

DC2021 issues #72

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

A number of issues with the DC2021 crawl.

Note that .uk seeds were accidentally marked as full seeds despite already being in .uk scope, and this is likely part of the problem as the whole system has to make and manage a massive augmented seed file.

But there also appear to be issues with H3 we should try to resolve.

There appear to be problems with cookie expiration: https://github.com/internetarchive/heritrix3/issues/427

Then there are problems related to seed management (too many seeds)...

java.lang.RuntimeException: java.io.IOException: Failed to copy full contents from '/shared/surts.txt' to '/jobs/frequent/20210809181017/surts.txt'
        at org.archive.spring.ConfigFile.obtainReader(ConfigFile.java:60)
        at uk.bl.wap.modules.deciderules.WatchedFileSurtPrefixedDecideRule$WatchedSurtFile.loadFile(WatchedFileSurtPrefixedDecideRule.java:65)
        at uk.bl.wap.util.WatchedFileSource.checkForReload(WatchedFileSource.java:96)
        at uk.bl.wap.util.WatchedFileSource$CacheUpdaterThread.run(WatchedFileSource.java:161)
Caused by: java.io.IOException: Failed to copy full contents from '/shared/surts.txt' to '/jobs/frequent/20210809181017/surts.txt'
        at org.apache.commons.io.FileUtils.doCopyFile(FileUtils.java:1157)
        at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:1091)
        at org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:1013)
        at org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:977)
        at org.archive.spring.ConfigPathConfigurer.snapshotToLaunchDir(ConfigPathConfigurer.java:203)
        at org.archive.spring.ConfigFile.obtainReader(ConfigFile.java:53)

quite a few of these, which appear to be a problem with how ExtractorXML expects things to work - perhaps there no content to get?

Aug 10, 2021 4:38:11 PM org.archive.util.Recorder getContentReplayPrefixString
SEVERE: unable to get replay prefix string
java.io.IOException: chunked stream ended unexpectedly
        at org.apache.commons.httpclient.ChunkedInputStream.getChunkSizeFromInputStream(ChunkedInputStream.java:252)
        at org.apache.commons.httpclient.ChunkedInputStream.nextChunk(ChunkedInputStream.java:221)
        at org.apache.commons.httpclient.ChunkedInputStream.read(ChunkedInputStream.java:176)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.Reader.read(Reader.java:140)
        at org.archive.util.Recorder.getContentReplayPrefixString(Recorder.java:513)
        at org.archive.util.Recorder.getContentReplayPrefixString(Recorder.java:500)
        at org.archive.modules.extractor.ExtractorXML.shouldExtract(ExtractorXML.java:84)
        at org.archive.modules.extractor.ContentExtractor.shouldProcess(ContentExtractor.java:84)

lots of these, which are harmless and long-standing, but it is irritating that dead domains are not handled more elegantly...

SEVERE: Problem java.lang.IllegalStateException: got suspicious value 0.0.0.0 for www.absolute-lofts.co.uk occurred when trying to process 'http://www.absolute-lofts.co.uk/' at step ABOUT_TO_BEGIN_PROCESSOR in

java.lang.IllegalStateException: got suspicious value 0.0.0.0 for www.absolute-lofts.co.uk
        at org.archive.modules.fetcher.DefaultServerCache.getHostFor(DefaultServerCache.java:112)
        at org.archive.modules.net.ServerCache.getHostFor(ServerCache.java:64)
        at org.archive.crawler.prefetch.QuotaEnforcer.innerProcessResult(QuotaEnforcer.java:392)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:147)

(registered issue about this here)

and then the big problem - a good chunk of these...

Aug 10, 2021 8:40:54 PM org.archive.crawler.framework.ToeThread seriousError
SEVERE: Serious error occurred trying to process 'CrawlURI https://www.expressfitautocentres.co.uk/robots.txt PR http://www.expressfitautocentres.co.uk/robots.txt in candidates'
[ToeThread #586: https://www.expressfitautocentres.co.uk/robots.txt
 CrawlURI https://www.expressfitautocentres.co.uk/robots.txt PR http://www.expressfitautocentres.co.uk/robots.txt    0 attempts
    in processor: candidates
    ACTIVE for 18s285ms
    step: ABOUT_TO_BEGIN_PROCESSOR for 16s933ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
    java.lang.Thread.getStackTrace(Thread.java:1559)
    org.archive.crawler.framework.ToeThread.reportThread(ToeThread.java:488)
    org.archive.crawler.framework.ToeThread.reportTo(ToeThread.java:460)
    org.archive.util.DevUtils.extraInfo(DevUtils.java:61)
    org.archive.crawler.framework.ToeThread.seriousError(ToeThread.java:271)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:176)
]
           timestamp  discovered      queued   downloaded       doc/s(avg)  KB/s(avg)   dl-failures   busy-thread   mem-use-KB  heap-size-KB   congestion   max-depth   avg-depth
2021-08-10T20:40:54Z    56542561    46839448      8405113     8.95(212.77)  174(7111)       1752180            64      9673929      12582912     2,378.74      502173          20

java.lang.OutOfMemoryError: Java heap space

At which point, all bets are off. There's some downstream grumbling about lock timeouts, but you know, after running out of memory everything is wonky.

I think the OOM stems from the seed problem, but we may as well up the heap allocation anyway.

anjackson commented 2 years ago

Hmm, the problem with the surts file was likely a file permissions thing.

anjackson commented 2 years ago

Having restarted with a bit more RAM and with the .uk seeds no longer marked as seeds, the crawl seems to be working much better.

After 18 hours, a quick performance analysis.

Most threads seem to be setting up or using HTTP connections, which is good.

About 80 are waiting for a lock related to queue rotation:

[ToeThread #198: 
 -no CrawlURI- 
    WAITING for 1s461ms
    step: ABOUT_TO_GET_URI for 1s461ms
Java Thread State: BLOCKED
Blocked/Waiting On: java.util.concurrent.ConcurrentSkipListMap@a3a4fdc which is owned by ToeThread #196: (273)
    org.archive.crawler.frontier.WorkQueueFrontier.deactivateQueue(WorkQueueFrontier.java:449)
    org.archive.crawler.frontier.WorkQueueFrontier.reenqueueQueue(WorkQueueFrontier.java:835)
    org.archive.crawler.frontier.WorkQueueFrontier.wakeQueues(WorkQueueFrontier.java:890)
    org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(WorkQueueFrontier.java:583)
    org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:457)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)
]

...where this is the lock-holder, which seems busy with cache/BDB eviction...

[ToeThread #196: 
 -no CrawlURI- 
    WAITING for 1s926ms
    step: ABOUT_TO_GET_URI for 1s926ms
Java Thread State: BLOCKED
Blocked/Waiting On: com.sleepycat.je.evictor.Evictor$LRUList@57a2a736 which is owned by ToeThread #489: http://www.theglovebox.co.uk/(566)
    com.sleepycat.je.evictor.Evictor$LRUList.moveBack(Evictor.java:959)
    com.sleepycat.je.evictor.Evictor.moveBack(Evictor.java:1947)
    com.sleepycat.je.tree.IN.updateLRU(IN.java:645)
    com.sleepycat.je.tree.IN.latch(IN.java:545)
    com.sleepycat.je.tree.Tree.latchChild(Tree.java:358)
    com.sleepycat.je.tree.Tree.getNextIN(Tree.java:1030)
    com.sleepycat.je.tree.Tree.getNextBin(Tree.java:874)
    com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:2624)
    com.sleepycat.je.Cursor.positionAllowPhantoms(Cursor.java:3252)
    com.sleepycat.je.Cursor.positionNoDups(Cursor.java:3165)
    com.sleepycat.je.Cursor.position(Cursor.java:3117)
    com.sleepycat.je.Cursor.getInternal(Cursor.java:1312)
    com.sleepycat.je.Cursor.get(Cursor.java:1233)
    com.sleepycat.util.keyrange.RangeCursor.doGetFirst(RangeCursor.java:1108)
    com.sleepycat.util.keyrange.RangeCursor.getFirst(RangeCursor.java:276)
    com.sleepycat.collections.DataCursor.getFirst(DataCursor.java:471)
    com.sleepycat.collections.StoredSortedMap.getFirstOrLastKey(StoredSortedMap.java:237)
    com.sleepycat.collections.StoredSortedMap.firstKey(StoredSortedMap.java:204)
    org.archive.bdb.StoredQueue.peek(StoredQueue.java:131)
    org.archive.bdb.StoredQueue.poll(StoredQueue.java:137)
    org.archive.bdb.StoredQueue.poll(StoredQueue.java:44)
    org.archive.crawler.frontier.WorkQueueFrontier.activateInactiveQueue(WorkQueueFrontier.java:773)
    org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(WorkQueueFrontier.java:597)
    org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:457)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)
]

Oddly, there are many threads awaiting the same lock, but reporting it as owned by different threads. This is perhaps the lock very rapidly being handed from thread to thread while the thread stack report is being collected for printing.

So, the speed of managing the Frontier queues appears to be the bottleneck, with the global lock on queue rotation somewhat amplifying this effect.

anjackson commented 2 years ago

After scaling down (600 > 400 ToeThreads) it seems stable. Was a bit weird for a while as I accidentally made it re-scan the full seed list, but it's settled down again now. Running okay, probably is roughly two-thirds speed!

Of roughly 200-250 threads in the candidates hase, 100-150/400 are in socket reads associated with OutbackCDX lookups for the candidates chain. The rest in BDB (~65, showing some lock contention) or Kafka awaiting (say ~30).

So, making OCDX faster is something to consider! What speed disk is it on? Notes imply vanilla gp2 (and so 6000 IOPS ish?). So one option is to upgrade this to a higher level with reserved IOPS.

Although the machine is heavily loaded, so maybe that's part of the reason OCDX is not able to respond more quickly?

anjackson commented 1 year ago

The issues were largely resolved at this point. Notes are held elsewhere.