Closed anjackson closed 1 year ago
Hmm, the problem with the surts file was likely a file permissions thing.
Having restarted with a bit more RAM and with the .uk
seeds no longer marked as seeds, the crawl seems to be working much better.
After 18 hours, a quick performance analysis.
Most threads seem to be setting up or using HTTP connections, which is good.
About 80 are waiting for a lock related to queue rotation:
[ToeThread #198:
-no CrawlURI-
WAITING for 1s461ms
step: ABOUT_TO_GET_URI for 1s461ms
Java Thread State: BLOCKED
Blocked/Waiting On: java.util.concurrent.ConcurrentSkipListMap@a3a4fdc which is owned by ToeThread #196: (273)
org.archive.crawler.frontier.WorkQueueFrontier.deactivateQueue(WorkQueueFrontier.java:449)
org.archive.crawler.frontier.WorkQueueFrontier.reenqueueQueue(WorkQueueFrontier.java:835)
org.archive.crawler.frontier.WorkQueueFrontier.wakeQueues(WorkQueueFrontier.java:890)
org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(WorkQueueFrontier.java:583)
org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:457)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)
]
...where this is the lock-holder, which seems busy with cache/BDB eviction...
[ToeThread #196:
-no CrawlURI-
WAITING for 1s926ms
step: ABOUT_TO_GET_URI for 1s926ms
Java Thread State: BLOCKED
Blocked/Waiting On: com.sleepycat.je.evictor.Evictor$LRUList@57a2a736 which is owned by ToeThread #489: http://www.theglovebox.co.uk/(566)
com.sleepycat.je.evictor.Evictor$LRUList.moveBack(Evictor.java:959)
com.sleepycat.je.evictor.Evictor.moveBack(Evictor.java:1947)
com.sleepycat.je.tree.IN.updateLRU(IN.java:645)
com.sleepycat.je.tree.IN.latch(IN.java:545)
com.sleepycat.je.tree.Tree.latchChild(Tree.java:358)
com.sleepycat.je.tree.Tree.getNextIN(Tree.java:1030)
com.sleepycat.je.tree.Tree.getNextBin(Tree.java:874)
com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:2624)
com.sleepycat.je.Cursor.positionAllowPhantoms(Cursor.java:3252)
com.sleepycat.je.Cursor.positionNoDups(Cursor.java:3165)
com.sleepycat.je.Cursor.position(Cursor.java:3117)
com.sleepycat.je.Cursor.getInternal(Cursor.java:1312)
com.sleepycat.je.Cursor.get(Cursor.java:1233)
com.sleepycat.util.keyrange.RangeCursor.doGetFirst(RangeCursor.java:1108)
com.sleepycat.util.keyrange.RangeCursor.getFirst(RangeCursor.java:276)
com.sleepycat.collections.DataCursor.getFirst(DataCursor.java:471)
com.sleepycat.collections.StoredSortedMap.getFirstOrLastKey(StoredSortedMap.java:237)
com.sleepycat.collections.StoredSortedMap.firstKey(StoredSortedMap.java:204)
org.archive.bdb.StoredQueue.peek(StoredQueue.java:131)
org.archive.bdb.StoredQueue.poll(StoredQueue.java:137)
org.archive.bdb.StoredQueue.poll(StoredQueue.java:44)
org.archive.crawler.frontier.WorkQueueFrontier.activateInactiveQueue(WorkQueueFrontier.java:773)
org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(WorkQueueFrontier.java:597)
org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:457)
org.archive.crawler.framework.ToeThread.run(ToeThread.java:134)
]
Oddly, there are many threads awaiting the same lock, but reporting it as owned by different threads. This is perhaps the lock very rapidly being handed from thread to thread while the thread stack report is being collected for printing.
So, the speed of managing the Frontier queues appears to be the bottleneck, with the global lock on queue rotation somewhat amplifying this effect.
After scaling down (600 > 400 ToeThreads) it seems stable. Was a bit weird for a while as I accidentally made it re-scan the full seed list, but it's settled down again now. Running okay, probably is roughly two-thirds speed!
Of roughly 200-250 threads in the candidates
hase, 100-150/400 are in socket reads associated with OutbackCDX lookups for the candidates chain. The rest in BDB (~65, showing some lock contention) or Kafka awaiting (say ~30).
So, making OCDX faster is something to consider! What speed disk is it on? Notes imply vanilla gp2
(and so 6000 IOPS ish?). So one option is to upgrade this to a higher level with reserved IOPS.
Although the machine is heavily loaded, so maybe that's part of the reason OCDX is not able to respond more quickly?
The issues were largely resolved at this point. Notes are held elsewhere.
A number of issues with the DC2021 crawl.
Note that .uk seeds were accidentally marked as full seeds despite already being in .uk scope, and this is likely part of the problem as the whole system has to make and manage a massive augmented seed file.
But there also appear to be issues with H3 we should try to resolve.
There appear to be problems with cookie expiration: https://github.com/internetarchive/heritrix3/issues/427
Then there are problems related to seed management (too many seeds)...
quite a few of these, which appear to be a problem with how ExtractorXML expects things to work - perhaps there no content to get?
lots of these, which are harmless and long-standing, but it is irritating that dead domains are not handled more elegantly...
(registered issue about this here)
and then the big problem - a good chunk of these...
At which point, all bets are off. There's some downstream grumbling about lock timeouts, but you know, after running out of memory everything is wonky.
I think the OOM stems from the seed problem, but we may as well up the heap allocation anyway.