kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
307 stars 143 forks source link

OCI Runtime error for init-pytorch on AKS #275

Closed wangdian closed 4 years ago

wangdian commented 4 years ago

Hi team: When we using pytorch-operator to submit training job. Sometimes it fails on init container init-pytorch. It seems that the memory limitation is too low. I found in pytorch-operator the limit for this init container is 10Mi, and I wonder if it is the cause. https://github.com/kubeflow/pytorch-operator/blob/b7fef224fef1ef0117f6e74961b557270fcf4b04/pkg/common/config/config.go#L19

The error for pod is like the following:

Events: Type Reason Age From Message


Warning FailedScheduling 7m10s volcano 2/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 2 minAvailable. Warning FailedScheduling 6m3s volcano 2/2 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 Pipelined, 2 minAvailable. Normal Scheduled 5m29s volcano Successfully assigned default/job-cmk8s-pytorch-1590738244-a8b20d9e-worker-0 to aks-agentpool-17822794-vmss000001 Normal Pulling 5m27s kubelet, aks-agentpool-17822794-vmss000001 Pulling image "cmaksacr.azurecr.io/cmk8s/mlc/azureml-setup:latest" Normal Pulled 5m27s kubelet, aks-agentpool-17822794-vmss000001 Successfully pulled image "cmaksacr.azurecr.io/cmk8s/mlc/azureml-setup:latest" Normal Created 5m26s kubelet, aks-agentpool-17822794-vmss000001 Created container azureml-setup Normal Started 5m26s kubelet, aks-agentpool-17822794-vmss000001 Started container azureml-setup Warning Failed 4m28s (x4 over 5m23s) kubelet, aks-agentpool-17822794-vmss000001 Error: failed to start container "init-pytorch": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\"failed to write 10485760 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/kubepods/besteffort/pode6e42a23-9d58-4ad7-9a89-9e589b8c7823/init-pytorch/memory.limit_in_bytes: device or resource busy\\"\"": unknown Normal Pulled 3m37s (x5 over 5m24s) kubelet, aks-agentpool-17822794-vmss000001 Container image "alpine:3.10" already present on machine Normal Created 3m37s (x5 over 5m23s) kubelet, aks-agentpool-17822794-vmss000001 Created container init-pytorch Warning BackOff 17s (x22 over 4m55s) kubelet, aks-agentpool-17822794-vmss000001 Back-off restarting failed container

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.58

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

johnugeorge commented 4 years ago

Can you try again with latest build?

Higher requests are added in https://github.com/kubeflow/pytorch-operator/pull/276/files

wangdian commented 4 years ago

The new resource limit works. Thanks! Sorry for late response.