I am currently running hyperparameter tuning, which results in creating 60x4 pods within my Kubeflow pipeline. During execution, I encountered an issue where the DAG pod is unable to complete its initialization successfully, preventing the pipeline from continuing
What did you expect to happen?
Pod Status: I observed the status of one of the DAG driver pods:
kubeflow-user-example-com auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 0/2 Init:StartError 0 84m
Logs Check: When I tried to fetch the logs for the pod, I received the following message:
kubectl logs auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 -n kubeflow-user-example-com Error from server (BadRequest): container "main" in pod "auto-digits-pipeline-half-complex2-xdlfv-system-dag-driver-3285612689" is waiting to start: PodInitializing
This suggested that the DAG pod might not be initializing due to the large number of pods that need to be executed concurrently.
Configuration Investigation: I attempted to locate the ConfigMap associated with the DAG to extend the initialization time limit, as I suspected that the pod timeout might be too short. I used the following command:
kubectl get cm -n kubeflow
However, I could not find a ConfigMap containing relevant parameters to control the DAG pod startup timeout.
Cluster Events: Upon further investigation by listing the cluster events:
kubectl get events
I found the following error:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /kind/bin/mount-product-files.sh: argument list too long: unknown
The error suggests that the initialization failure might be caused by the number of arguments or the size of the arguments passed to /kind/bin/mount-product-files.sh, which is exceeding the allowable limit, leading to the failure of the container creation process.
Question
How can I modify the corresponding parameters to avoid this "argument list too long" issue during the container initialization phase? Specifically, I would appreciate guidance on:
Identifying the appropriate ConfigMap or configuration where I can modify the initialization settings for the DAG pods.
Mitigating the "argument list too long" issue, possibly by optimizing or limiting the number of mounted files or arguments.
Any insights or suggestions on how to address this issue would be greatly appreciated.
Environment
Kubernetes version:1.9
$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.0
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
What happened?
I am currently running hyperparameter tuning, which results in creating 60x4 pods within my Kubeflow pipeline. During execution, I encountered an issue where the DAG pod is unable to complete its initialization successfully, preventing the pipeline from continuing
What did you expect to happen?
Pod Status: I observed the status of one of the DAG driver pods:
kubeflow-user-example-com auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 0/2 Init:StartError 0 84m
Logs Check: When I tried to fetch the logs for the pod, I received the following message:
kubectl logs auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 -n kubeflow-user-example-com Error from server (BadRequest): container "main" in pod "auto-digits-pipeline-half-complex2-xdlfv-system-dag-driver-3285612689" is waiting to start: PodInitializing
This suggested that the DAG pod might not be initializing due to the large number of pods that need to be executed concurrently. Configuration Investigation: I attempted to locate the ConfigMap associated with the DAG to extend the initialization time limit, as I suspected that the pod timeout might be too short. I used the following command:
kubectl get cm -n kubeflow
However, I could not find a ConfigMap containing relevant parameters to control the DAG pod startup timeout.Cluster Events: Upon further investigation by listing the cluster events:
kubectl get events
I found the following error:Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /kind/bin/mount-product-files.sh: argument list too long: unknown
The error suggests that the initialization failure might be caused by the number of arguments or the size of the arguments passed to /kind/bin/mount-product-files.sh, which is exceeding the allowable limit, leading to the failure of the container creation process.
Question
How can I modify the corresponding parameters to avoid this "argument list too long" issue during the container initialization phase? Specifically, I would appreciate guidance on:
Identifying the appropriate ConfigMap or configuration where I can modify the initialization settings for the DAG pods.
Mitigating the "argument list too long" issue, possibly by optimizing or limiting the number of mounted files or arguments.
Any insights or suggestions on how to address this issue would be greatly appreciated.
Environment
Kubernetes version:1.9
Training Operator version:
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍