Closed eu9ene closed 3 months ago
@bhearsum I'm not sure what the reason for the restart is. Can it be 8 preemptions in a row?
These are CLAIM_EXPIRED
, which is more suggestive of OOM. Nonetheless, I looked at each run for preemptions.
These were preempted: 1 (at 2024-07-20T23:00:40.214910907Z), 6 (at 2024-07-22T18:28:01.554965507Z), 8 (at 2024-07-22T22:55:18.320283407Z)
These were not: 0, 2, 3, 4, 5, 7, 9, 10, 11
That snippet from run 8 is interesting - it seems to show the task completing more than 2 hours prior to the preemption.
I found this in our papertrail logs, which seems to be the cause of the internal error:
Jul 22 21:36:32Z [translations-1-b-linux-large-gcp-300gb-eqmp3ottractz8kwy1l5aq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-eqmp3ottractz8kwy1l5aq/events?focus=1752157786650869770&selected=1752157786650869770) [docker-worker](https://my.papertrailapp.com/groups/1141234/events?q=program%3Adocker-worker&focus=1752157786650869770&selected=1752157786650869770) 2024/07/22 21:36:32 {"type":"error reclaiming task","source":"top","provisionerId":"translations-1","workerId":"7915041610498507774","workerGroup":"us-central1-f","workerType":"b-linux-large-gcp-300gb","workerNodeType":"projects/887720501152/machineTypes/n2-highmem-32","error":"Could not reclaim task. Error: Timeout of 30000ms exceeded\n at Request._timeoutError (/home/ubuntu/docker-worker/node_modules/superagent/src/request-base.js:722:15)\n at Timeout.<anonymous> (/home/ubuntu/docker-worker/node_modules/superagent/src/request-base.js:738:12)\n at listOnTimeout (internal/timers.js:554:17)\n at processTimers (internal/timers.js:497:7)","primaryTaskId":"Jjou5vQvTHigEP6NtrtnkQ","primaryRunId":8,"taskId":"Jjou5vQvTHigEP6NtrtnkQ","runId":8,"takenUntil":"2024-07-22T21:43:27.043Z"}
I see the same in run 11. I'm looking into this / asking about it further.
Found some more info; it looks like dockerd is hanging, and then getting killed:
ul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717563084805&selected=1752300717563084805) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717563084805&selected=1752300717563084805) [ 5198.767442] INFO: task dockerd:5227 blocked for more than 724 seconds.
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717563084806&selected=1752300717563084806) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717563084806&selected=1752300717563084806) [ 5198.774126] Tainted: G OE 5.4.0-1106-gcp #115~18.04.1-Ubuntu
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717563084807&selected=1752300717563084807) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717563084807&selected=1752300717563084807) [ 5198.781652] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833544&selected=1752300717600833544) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833544&selected=1752300717600833544) [ 5198.789626] dockerd D 0 5227 1 0x00004000
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833545&selected=1752300717600833545) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833545&selected=1752300717600833545) [ 5198.789630] Call Trace:
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833546&selected=1752300717600833546) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833546&selected=1752300717600833546) [ 5198.789638] __schedule+0x293/0x740
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833547&selected=1752300717600833547) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833547&selected=1752300717600833547) [ 5198.789642] schedule+0x33/0xa0
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833549&selected=1752300717600833549) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833549&selected=1752300717600833549) [ 5198.789645] wb_wait_for_completion+0x56/0x90
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027840&selected=1752300717605027840) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027840&selected=1752300717605027840) [ 5198.789648] ? __wake_up_pollfree+0x40/0x40
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027841&selected=1752300717605027841) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027841&selected=1752300717605027841) [ 5198.789649] __writeback_inodes_sb_nr+0x9e/0xc0
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027842&selected=1752300717605027842) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027842&selected=1752300717605027842) [ 5198.789650] writeback_inodes_sb+0x27/0x30
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027843&selected=1752300717605027843) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027843&selected=1752300717605027843) [ 5198.789651] __sync_filesystem+0x51/0x60
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027844&selected=1752300717605027844) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027844&selected=1752300717605027844) [ 5198.789652] sync_filesystem+0x28/0x40
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027845&selected=1752300717605027845) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027845&selected=1752300717605027845) [ 5198.789658] ovl_sync_fs+0x3f/0x60 [overlay]
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027846&selected=1752300717605027846) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027846&selected=1752300717605027846) [ 5198.789659] __sync_filesystem+0x33/0x60
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027847&selected=1752300717605027847) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027847&selected=1752300717605027847) [ 5198.789660] sync_filesystem+0x39/0x40
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027848&selected=1752300717605027848) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027848&selected=1752300717605027848) [ 5198.789662] generic_shutdown_super+0x27/0x120
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027849&selected=1752300717605027849) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027849&selected=1752300717605027849) [ 5198.789662] kill_anon_super+0x12/0x30
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027850&selected=1752300717605027850) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027850&selected=1752300717605027850) [ 5198.789663] deactivate_locked_super+0x48/0x80
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027851&selected=1752300717605027851) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027851&selected=1752300717605027851) [ 5198.789664] deactivate_super+0x40/0x60
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027852&selected=1752300717605027852) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027852&selected=1752300717605027852) [ 5198.789666] cleanup_mnt+0xbd/0x150
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027853&selected=1752300717605027853) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027853&selected=1752300717605027853) [ 5198.789667] __cleanup_mnt+0x12/0x20
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027854&selected=1752300717605027854) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027854&selected=1752300717605027854) [ 5198.789669] task_work_run+0x9d/0xc0
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027855&selected=1752300717605027855) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027855&selected=1752300717605027855) [ 5198.789671] exit_to_usermode_loop+0x109/0x130
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027856&selected=1752300717605027856) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027856&selected=1752300717605027856) [ 5198.789672] do_syscall_64+0x170/0x190
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027857&selected=1752300717605027857) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027857&selected=1752300717605027857) [ 5198.789674] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027858&selected=1752300717605027858) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027858&selected=1752300717605027858) [ 5198.789675] RIP: 0033:0x5645d2d4132e
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027859&selected=1752300717605027859) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027859&selected=1752300717605027859) [ 5198.789680] Code: Bad RIP value.
https://github.com/moby/moby/issues/44552 is a similar hang. The response there suggests the system could be thrashing (eg: due to low resources) or it could a kernel bug. I don't know if the latter would be possible to deal with other than by switching to generic-worker (which is not ready to go yet...).
@eu9ene - Is it possible that we're OOMing, or perhaps running out of disk space, or something like that?
It is possible that we run out of disk space if Taskcluster moves data around after the task has finished. Based on logs the script exits successfully. echo '###### Done: Merging datasets'
is the last line of it.
I believe it does; can you try starting this again on b-linux-large-gcp-1tb-32-256
to test this theory?
It's completed now. The machine is too big for this task though.
It's completed now. The machine is too big for this task though.
Both b-linux-large-gcp-300gb
and b-linux-large-gcp-d2g-1tb
are actually n2-custom-32-262144
. The only difference is the disk size. It sounds like we should probably create some lower cpu/ram workers for things like this?
Probably yes, but it requires #414. I'll switch this task to the 1TB version for now.
https://firefox-ci-tc.services.mozilla.com/tasks/Jjou5vQvTHigEP6NtrtnkQ/runs/8
Based on the logs the task has finished correctly but then something happened.