Closed wangdian closed 4 years ago
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/bug | 0.58 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
Can you try again with latest build?
Higher requests are added in https://github.com/kubeflow/pytorch-operator/pull/276/files
The new resource limit works. Thanks! Sorry for late response.
Hi team: When we using pytorch-operator to submit training job. Sometimes it fails on init container init-pytorch. It seems that the memory limitation is too low. I found in pytorch-operator the limit for this init container is 10Mi, and I wonder if it is the cause. https://github.com/kubeflow/pytorch-operator/blob/b7fef224fef1ef0117f6e74961b557270fcf4b04/pkg/common/config/config.go#L19
The error for pod is like the following:
Events: Type Reason Age From Message
Warning FailedScheduling 7m10s volcano 2/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 2 minAvailable. Warning FailedScheduling 6m3s volcano 2/2 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 Pipelined, 2 minAvailable. Normal Scheduled 5m29s volcano Successfully assigned default/job-cmk8s-pytorch-1590738244-a8b20d9e-worker-0 to aks-agentpool-17822794-vmss000001 Normal Pulling 5m27s kubelet, aks-agentpool-17822794-vmss000001 Pulling image "cmaksacr.azurecr.io/cmk8s/mlc/azureml-setup:latest" Normal Pulled 5m27s kubelet, aks-agentpool-17822794-vmss000001 Successfully pulled image "cmaksacr.azurecr.io/cmk8s/mlc/azureml-setup:latest" Normal Created 5m26s kubelet, aks-agentpool-17822794-vmss000001 Created container azureml-setup Normal Started 5m26s kubelet, aks-agentpool-17822794-vmss000001 Started container azureml-setup Warning Failed 4m28s (x4 over 5m23s) kubelet, aks-agentpool-17822794-vmss000001 Error: failed to start container "init-pytorch": Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:390: setting cgroup config for procHooks process caused \\"failed to write 10485760 to memory.limit_in_bytes: write /sys/fs/cgroup/memory/kubepods/besteffort/pode6e42a23-9d58-4ad7-9a89-9e589b8c7823/init-pytorch/memory.limit_in_bytes: device or resource busy\\"\"": unknown Normal Pulled 3m37s (x5 over 5m24s) kubelet, aks-agentpool-17822794-vmss000001 Container image "alpine:3.10" already present on machine Normal Created 3m37s (x5 over 5m23s) kubelet, aks-agentpool-17822794-vmss000001 Created container init-pytorch Warning BackOff 17s (x22 over 4m55s) kubelet, aks-agentpool-17822794-vmss000001 Back-off restarting failed container