openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
262 stars 22 forks source link

High crawling duration variability #214

Open benoit74 opened 10 months ago

benoit74 commented 10 months ago

We can observe a very high crawling duration variability on dp.la_en_all recipe. All tasks are using the same image ("ghcr.io/openzim/zimit:1.5.0") and have launched at almost the same time (due to a zimfarm bug)

Task ID Worker Start Time Duration Comment Logs
06dd0c9e-d06d-474b-beda-36cc99281cd7 michaelblob 27 August 2023 at 23:02 CEST 1 hour, 30 minutes Logs
eaff89fa-0891-438c-b5cd-cac32a847845 michaelblob 27 August 2023 at 22:02 CEST 2 hours, 50 minutes Logs
653c7019-e5b9-407c-9107-0d3bb641be68 michaelblob 27 August 2023 at 21:02 CEST 4 hours Logs
2642a0ad-3a15-4e08-8e43-d9a2d6a43d32 michaelblob 28 August 2023 at 02:02 CEST 2 hours, 50 minutes Logs
a6d210c8-6d3b-4548-bd48-7954e2747253 athena18 28 August 2023 at 00:00 CEST 10 hours, 50 minutes was still processing, cancelled manually Logs
rgaudin commented 10 months ago

Ah, saw the logs appear as I was downloading them! 😉 but there's no log for canceled tasks

benoit74 commented 10 months ago

Are those logs purged from S3 after a while?

rgaudin commented 10 months ago

Base information: athena18 is a large but dated worker on a somewhat slow and high latency network that is subject to variations of its environment (it's on a university network).

rgaudin commented 10 months ago

Are those logs purged from S3 after a while?

2 months

https://github.com/kiwix/k8s/blob/main/zimfarm/api/api-configs.cm.yaml#L22

benoit74 commented 1 week ago

I don't know (yet?) what we can do of this issue