kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.63k stars 1.63k forks source link

[bug]when running pipeline code, the pod DAG always stay in status Init:StartError #11422

Open Epochex opened 1 week ago

Epochex commented 1 week ago

What happened?

I am currently running hyperparameter tuning, which results in creating 60x4 pods within my Kubeflow pipeline. During execution, I encountered an issue where the DAG pod is unable to complete its initialization successfully, preventing the pipeline from continuing

What did you expect to happen?

Pod Status: I observed the status of one of the DAG driver pods: kubeflow-user-example-com auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 0/2 Init:StartError 0 84m

Logs Check: When I tried to fetch the logs for the pod, I received the following message:

kubectl logs auto-digits-pipeline-half-complex2-tvdqj-system-dag-driver-1079452148 -n kubeflow-user-example-com Error from server (BadRequest): container "main" in pod "auto-digits-pipeline-half-complex2-xdlfv-system-dag-driver-3285612689" is waiting to start: PodInitializing

This suggested that the DAG pod might not be initializing due to the large number of pods that need to be executed concurrently. Configuration Investigation: I attempted to locate the ConfigMap associated with the DAG to extend the initialization time limit, as I suspected that the pod timeout might be too short. I used the following command:

kubectl get cm -n kubeflow However, I could not find a ConfigMap containing relevant parameters to control the DAG pod startup timeout.

Cluster Events: Upon further investigation by listing the cluster events: kubectl get events I found the following error: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: fork/exec /kind/bin/mount-product-files.sh: argument list too long: unknown

The error suggests that the initialization failure might be caused by the number of arguments or the size of the arguments passed to /kind/bin/mount-product-files.sh, which is exceeding the allowable limit, leading to the failure of the container creation process.

Question

How can I modify the corresponding parameters to avoid this "argument list too long" issue during the container initialization phase? Specifically, I would appreciate guidance on:

Identifying the appropriate ConfigMap or configuration where I can modify the initialization settings for the DAG pods.

Mitigating the "argument list too long" issue, possibly by optimizing or limiting the number of mounted files or arguments.

Any insights or suggestions on how to address this issue would be greatly appreciated.

Environment

Kubernetes version:1.9

$ kubectl version
Client Version: v1.31.2
Kustomize Version: v5.4.2
Server Version: v1.31.0

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

andreyvelich commented 20 hours ago

Hi @Epochex, I think this issue is related to Kubeflow Pipelines, not Kubeflow Training Operator. /transfer pipelines