openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
335 stars 24 forks source link

Browsertrix Crawler is stopping on disk full while it is not full #290

Closed benoit74 closed 4 months ago

benoit74 commented 6 months ago

Browsertrix crawler: version 1.0.0-beta.6

This occured on Zimit 2 but might have no link with it, since it could be either a crawler problem or a Docker / Zimfarm issue.

Recipe: https://farm.openzim.org/recipes/bbc.com_persian Task: https://farm.openzim.org/pipeline/29c24848-9c12-4253-8939-77254b01fdd5

image

{"timestamp":"2024-03-21T11:46:53.570Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.bbc.com/persian/articles/c9rw9yr764eo","workerid":0}}
{"timestamp":"2024-03-21T11:46:53.578Z","logLevel":"info","context":"general","message":"Disk utilization projected to reach threshold 90% > 90%, stopping","details":{}}
{"timestamp":"2024-03-21T11:46:53.578Z","logLevel":"info","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}
{"timestamp":"2024-03-21T11:46:53.578Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-03-21T11:46:53.852Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmp9jqp9697/collections/crawl-20240319073437143/crawls/crawl-20240321114653-a14d9f23d744.yaml","details":{}}
{"timestamp":"2024-03-21T11:46:53.864Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":22833,"total":43845,"pending":0,"failed":4,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-03-21T11:46:53.865Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-03-21T11:46:53.866Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
[zimit::2024-03-21 11:46:53,893] INFO:crawl interupted by a limit

I will investigate a bit before reporting upstream, I first need to confirm this is not a problem linked to the Zimfarm handling of Docker containers or our custom image.

benoit74 commented 4 months ago

New run with more logs and disable stopping of the crawl gives some more insight.

First warning is at 2024-05-10T08:05:46.607Z (this would have stopped the crawler if image wasn't customized to disable stopping the crawl):

{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"Disk utilization projected to reach threshold 90% > 90%, stopping","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"kbUsed: 7817108","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"kbTotal: 31457280","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"kbArchiveDirSize: 20337210","details":{}}
{"timestamp":"2024-05-10T08:05:46.607Z","logLevel":"info","context":"general","message":"usedPercentage: 26","details":{}}
{"timestamp":"2024-05-10T08:05:46.608Z","logLevel":"info","context":"general","message":"adjustedUsedPercentage: 25","details":{}}
{"timestamp":"2024-05-10T08:05:46.608Z","logLevel":"info","context":"general","message":"projectedTotal: 28154318","details":{}}
{"timestamp":"2024-05-10T08:05:46.608Z","logLevel":"info","context":"general","message":"projectedUsedPercentage: 90","details":{}}
{"timestamp":"2024-05-10T08:05:46.773Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.bbc.com/persian/iran-61792470"}}
{"timestamp":"2024-05-10T08:05:46.773Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":26873,"total":48242,"pending":1,"failed":5,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-05-10T08:05:46.610Z\",\"extraHops\":0,\"url\":\"https://www.bbc.com/persian/iran-61792470\",\"added\":\"2024-05-08T11:36:18.314Z\",\"depth\":5}"]}}

These values make no sense:

Final logs are

{"timestamp":"2024-05-19T16:41:14.458Z","logLevel":"info","context":"general","message":"Disk utilization projected to reach threshold 192% > 90%, stopping","details":{}}
{"timestamp":"2024-05-19T16:41:14.460Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-05-19T16:41:19.987Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":199127,"total":199127,"pending":0,"failed":4647,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-05-19T16:41:19.989Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-05-19T16:41:19.990Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}

Final compressed warc.gz (which is supposed to be the sheer of disk utilization) is about 47G.

Not sure what is wrong here, but something is ^^

I'm waiting for feedback from worker owner which might have some more information to share.

benoit74 commented 4 months ago

Zimit is working correctly, in fact badger1 had a special FS layout where only 30G was assigned to Docker containers. The fact that 1000G is configured is an inconsistency. I've opened an issue on Zimfarm (https://github.com/openzim/zimfarm/issues/976) to not be mislead again by an inconsistency in zimfarm.config.