Closed benoit74 closed 4 months ago
New run with more logs and disable stopping of the crawl gives some more insight.
First warning is at 2024-05-10T08:05:46.607Z (this would have stopped the crawler if image wasn't customized to disable stopping the crawl):
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"Disk utilization projected to reach threshold 90% > 90%, stopping","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"kbUsed: 7817108","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"kbTotal: 31457280","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"kbArchiveDirSize: 20337210","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"usedPercentage: 26","details":{}}
{"timestamp":"2024-05-10T08:05:46.608Z","logLevel":"info","context":"general","message":"adjustedUsedPercentage: 25","details":{}}
{"timestamp":"2024-05-10T08:05:46.608Z","logLevel":"info","context":"general","message":"projectedTotal: 28154318","details":{}}
{"timestamp":"2024-05-10T08:05:46.608Z","logLevel":"info","context":"general","message":"projectedUsedPercentage: 90","details":{}}
{"timestamp":"2024-05-10T08:05:46.773Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.bbc.com/persian/iran-61792470"}}
{"timestamp":"2024-05-10T08:05:46.773Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":26873,"total":48242,"pending":1,"failed":5,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-05-10T08:05:46.610Z\",\"extraHops\":0,\"url\":\"https://www.bbc.com/persian/iran-61792470\",\"added\":\"2024-05-08T11:36:18.314Z\",\"depth\":5}"]}}
These values make no sense:
Final logs are
{"timestamp":"2024-05-19T16:41:14.458Z","logLevel":"info","context":"general","message":"Disk utilization projected to reach threshold 192% > 90%, stopping","details":{}}
{"timestamp":"2024-05-19T16:41:14.460Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-05-19T16:41:19.987Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":199127,"total":199127,"pending":0,"failed":4647,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-05-19T16:41:19.989Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-05-19T16:41:19.990Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}
Final compressed warc.gz (which is supposed to be the sheer of disk utilization) is about 47G.
Not sure what is wrong here, but something is ^^
I'm waiting for feedback from worker owner which might have some more information to share.
Zimit is working correctly, in fact badger1
had a special FS layout where only 30G was assigned to Docker containers. The fact that 1000G is configured is an inconsistency. I've opened an issue on Zimfarm (https://github.com/openzim/zimfarm/issues/976) to not be mislead again by an inconsistency in zimfarm.config.
Browsertrix crawler: version 1.0.0-beta.6
This occured on Zimit 2 but might have no link with it, since it could be either a crawler problem or a Docker / Zimfarm issue.
Recipe: https://farm.openzim.org/recipes/bbc.com_persian Task: https://farm.openzim.org/pipeline/29c24848-9c12-4253-8939-77254b01fdd5
I will investigate a bit before reporting upstream, I first need to confirm this is not a problem linked to the Zimfarm handling of Docker containers or our custom image.