mozilla / firefox-translations-training

Training pipelines for Firefox Translations neural machine translation models
https://mozilla.github.io/firefox-translations-training/
Mozilla Public License 2.0
143 stars 31 forks source link

merge-translated-el-en keeps restarting #760

Closed eu9ene closed 1 month ago

eu9ene commented 1 month ago

https://firefox-ci-tc.services.mozilla.com/tasks/Jjou5vQvTHigEP6NtrtnkQ/runs/8

Based on the logs the task has finished correctly but then something happened.

[task 2024-07-22T19:46:31.563Z] + '[' 334655495 '!=' 334655495 ']'
[task 2024-07-22T19:46:31.563Z] + rm -rf /builds/worker/artifacts/tmp
[task 2024-07-22T19:46:51.702Z] + echo '###### Done: Merging datasets'
[task 2024-07-22T19:46:51.702Z] ###### Done: Merging datasets
[fetches 2024-07-22T19:46:51.702Z] removing /builds/worker/fetches
[fetches 2024-07-22T19:47:06.590Z] finished
[taskcluster 2024-07-22 19:49:52.200Z] === Task Finished ===
[taskcluster:error] Task has been aborted prematurely. Reason: internal-error
[taskcluster 2024-07-22 20:09:35.185Z] Successful task run with exit code: 0 completed in 5275.153 seconds
eu9ene commented 1 month ago

@bhearsum I'm not sure what the reason for the restart is. Can it be 8 preemptions in a row?

bhearsum commented 1 month ago

These are CLAIM_EXPIRED, which is more suggestive of OOM. Nonetheless, I looked at each run for preemptions. These were preempted: 1 (at 2024-07-20T23:00:40.214910907Z), 6 (at 2024-07-22T18:28:01.554965507Z), 8 (at 2024-07-22T22:55:18.320283407Z) These were not: 0, 2, 3, 4, 5, 7, 9, 10, 11

That snippet from run 8 is interesting - it seems to show the task completing more than 2 hours prior to the preemption.

I found this in our papertrail logs, which seems to be the cause of the internal error:

Jul 22 21:36:32Z [translations-1-b-linux-large-gcp-300gb-eqmp3ottractz8kwy1l5aq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-eqmp3ottractz8kwy1l5aq/events?focus=1752157786650869770&selected=1752157786650869770) [docker-worker](https://my.papertrailapp.com/groups/1141234/events?q=program%3Adocker-worker&focus=1752157786650869770&selected=1752157786650869770) 2024/07/22 21:36:32 {"type":"error reclaiming task","source":"top","provisionerId":"translations-1","workerId":"7915041610498507774","workerGroup":"us-central1-f","workerType":"b-linux-large-gcp-300gb","workerNodeType":"projects/887720501152/machineTypes/n2-highmem-32","error":"Could not reclaim task. Error: Timeout of 30000ms exceeded\n    at Request._timeoutError (/home/ubuntu/docker-worker/node_modules/superagent/src/request-base.js:722:15)\n    at Timeout.<anonymous> (/home/ubuntu/docker-worker/node_modules/superagent/src/request-base.js:738:12)\n    at listOnTimeout (internal/timers.js:554:17)\n    at processTimers (internal/timers.js:497:7)","primaryTaskId":"Jjou5vQvTHigEP6NtrtnkQ","primaryRunId":8,"taskId":"Jjou5vQvTHigEP6NtrtnkQ","runId":8,"takenUntil":"2024-07-22T21:43:27.043Z"}

I see the same in run 11. I'm looking into this / asking about it further.

bhearsum commented 1 month ago

Found some more info; it looks like dockerd is hanging, and then getting killed:

ul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717563084805&selected=1752300717563084805) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717563084805&selected=1752300717563084805) [ 5198.767442] INFO: task dockerd:5227 blocked for more than 724 seconds.
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717563084806&selected=1752300717563084806) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717563084806&selected=1752300717563084806) [ 5198.774126]       Tainted: G           OE     5.4.0-1106-gcp #115~18.04.1-Ubuntu
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717563084807&selected=1752300717563084807) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717563084807&selected=1752300717563084807) [ 5198.781652] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833544&selected=1752300717600833544) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833544&selected=1752300717600833544) [ 5198.789626] dockerd         D    0  5227      1 0x00004000
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833545&selected=1752300717600833545) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833545&selected=1752300717600833545) [ 5198.789630] Call Trace:
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833546&selected=1752300717600833546) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833546&selected=1752300717600833546) [ 5198.789638]  __schedule+0x293/0x740
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833547&selected=1752300717600833547) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833547&selected=1752300717600833547) [ 5198.789642]  schedule+0x33/0xa0
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717600833549&selected=1752300717600833549) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717600833549&selected=1752300717600833549) [ 5198.789645]  wb_wait_for_completion+0x56/0x90
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027840&selected=1752300717605027840) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027840&selected=1752300717605027840) [ 5198.789648]  ? __wake_up_pollfree+0x40/0x40
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027841&selected=1752300717605027841) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027841&selected=1752300717605027841) [ 5198.789649]  __writeback_inodes_sb_nr+0x9e/0xc0
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027842&selected=1752300717605027842) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027842&selected=1752300717605027842) [ 5198.789650]  writeback_inodes_sb+0x27/0x30
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027843&selected=1752300717605027843) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027843&selected=1752300717605027843) [ 5198.789651]  __sync_filesystem+0x51/0x60
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027844&selected=1752300717605027844) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027844&selected=1752300717605027844) [ 5198.789652]  sync_filesystem+0x28/0x40
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027845&selected=1752300717605027845) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027845&selected=1752300717605027845) [ 5198.789658]  ovl_sync_fs+0x3f/0x60 [overlay]
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027846&selected=1752300717605027846) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027846&selected=1752300717605027846) [ 5198.789659]  __sync_filesystem+0x33/0x60
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027847&selected=1752300717605027847) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027847&selected=1752300717605027847) [ 5198.789660]  sync_filesystem+0x39/0x40
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027848&selected=1752300717605027848) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027848&selected=1752300717605027848) [ 5198.789662]  generic_shutdown_super+0x27/0x120
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027849&selected=1752300717605027849) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027849&selected=1752300717605027849) [ 5198.789662]  kill_anon_super+0x12/0x30
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027850&selected=1752300717605027850) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027850&selected=1752300717605027850) [ 5198.789663]  deactivate_locked_super+0x48/0x80
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027851&selected=1752300717605027851) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027851&selected=1752300717605027851) [ 5198.789664]  deactivate_super+0x40/0x60
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027852&selected=1752300717605027852) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027852&selected=1752300717605027852) [ 5198.789666]  cleanup_mnt+0xbd/0x150
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027853&selected=1752300717605027853) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027853&selected=1752300717605027853) [ 5198.789667]  __cleanup_mnt+0x12/0x20
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027854&selected=1752300717605027854) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027854&selected=1752300717605027854) [ 5198.789669]  task_work_run+0x9d/0xc0
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027855&selected=1752300717605027855) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027855&selected=1752300717605027855) [ 5198.789671]  exit_to_usermode_loop+0x109/0x130
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027856&selected=1752300717605027856) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027856&selected=1752300717605027856) [ 5198.789672]  do_syscall_64+0x170/0x190
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027857&selected=1752300717605027857) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027857&selected=1752300717605027857) [ 5198.789674]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027858&selected=1752300717605027858) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027858&selected=1752300717605027858) [ 5198.789675] RIP: 0033:0x5645d2d4132e
Jul 23 07:04:29Z [translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq](https://my.papertrailapp.com/systems/translations-1-b-linux-large-gcp-300gb-oe3wny4yqlsj35ezt6slrq/events?focus=1752300717605027859&selected=1752300717605027859) [kernel](https://my.papertrailapp.com/groups/1141234/events?q=program%3Akernel&focus=1752300717605027859&selected=1752300717605027859) [ 5198.789680] Code: Bad RIP value.

https://github.com/moby/moby/issues/44552 is a similar hang. The response there suggests the system could be thrashing (eg: due to low resources) or it could a kernel bug. I don't know if the latter would be possible to deal with other than by switching to generic-worker (which is not ready to go yet...).

@eu9ene - Is it possible that we're OOMing, or perhaps running out of disk space, or something like that?

eu9ene commented 1 month ago

It is possible that we run out of disk space if Taskcluster moves data around after the task has finished. Based on logs the script exits successfully. echo '###### Done: Merging datasets' is the last line of it.

bhearsum commented 1 month ago

I believe it does; can you try starting this again on b-linux-large-gcp-1tb-32-256 to test this theory?

eu9ene commented 1 month ago

running https://firefox-ci-tc.services.mozilla.com/tasks/QktAK7xERJWGuYKdqRJwZg/runs/0/logs/live/public/logs/live.log

eu9ene commented 1 month ago

It's completed now. The machine is too big for this task though.

bhearsum commented 1 month ago

It's completed now. The machine is too big for this task though.

Both b-linux-large-gcp-300gb and b-linux-large-gcp-d2g-1tb are actually n2-custom-32-262144. The only difference is the disk size. It sounds like we should probably create some lower cpu/ram workers for things like this?

eu9ene commented 1 month ago

Probably yes, but it requires #414. I'll switch this task to the 1TB version for now.