kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.57k stars 682 forks source link

Back-off pulling image "alpine:3.10" #2204

Open lizu18xz opened 1 month ago

lizu18xz commented 1 month ago

What happened?

What command can be used to modify the image name in initContainers? image

image

What did you expect to happen?

modify the image name in initContainers

Environment

Kubernetes version:

$ kubectl version

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:v1-855e096

Training Operator Python SDK version:

$ pip show kubeflow-training

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

wucb commented 4 weeks ago

Modify training-operator config set --pytorch-init-container-image=xxx in commd

tenzen-y commented 3 weeks ago

Modify training-operator config set --pytorch-init-container-image=xxx in commd

Yes, you can pass an arbitrary initContainer image via the operator command: https://github.com/kubeflow/training-operator/blob/ff19a10842deba1fdb553a8b89e332731bf4f3ad/cmd/training-operator.v1/main.go#L98-L99

/remove-kind bug /kind question

lizu18xz commented 3 weeks ago

training-operator config

training-operator config Where is it? I found deploymenty.yaml 。But I don't know how to set pytorch-init-container-image

lizu18xz commented 3 weeks ago

Modify training-operator config set --pytorch-init-container-image=xxx in commd

/remove-kind bug

Modify training-operator config set --pytorch-init-container-image=xxx in commd

Yes, you can pass an arbitrary initContainer image via the operator command:

https://github.com/kubeflow/training-operator/blob/ff19a10842deba1fdb553a8b89e332731bf4f3ad/cmd/training-operator.v1/main.go#L98-L99

/remove-kind bug /kind question

How to use this command? I only found deploymenty.yaml. image I couldn't find the training operator config, use : kubectl get cm -A Can you explain it in detail? Thank you

Syulin7 commented 3 weeks ago

@lizu18xz

kubectl edit deploy training-operator -n kubeflow

    spec:
      containers:
      - command:
        - /manager
        - --pytorch-init-container-image=your-image
andreyvelich commented 3 weeks ago

/remove-label lifecycle/needs-triage