kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 700 forks source link

Training Operator - panic: runtime error: index out of range #1842

Open srinandan opened 1 year ago

srinandan commented 1 year ago

WHAT DID YOU DO:

Deployed Kubeflow 1.7.0 to a 1.25.8-gke.1000 GKE cluster. The training-operator image installed is: kubeflow/training-operator:v1-5a5f92d

EXPECTED:

I started a run for a pipeline (kpf version 1.8) and I expected the training job to start.

ACTUAL:

TrainingOperator crash CrashLoopBackOff

Logs from the container:

1.6881420113012216e+09  INFO    Starting workers    {"controller": "paddlejob-controller", "worker count": 1}
E0630 16:20:11.301812       1 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)
goroutine 4639 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x19fbd00?, 0xc000b82180})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x4172eb?})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/runtime/runtime.go:49 +0x75
panic({0x19fbd00, 0xc000b82180})
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.hasDefaultPort(...)
    /workspace/pkg/apis/kubeflow.org/v1/defaulting_utils.go:21
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.setPytorchDefaultPort(...)
    /workspace/pkg/apis/kubeflow.org/v1/pytorch_defaults.go:31
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.SetDefaults_PyTorchJob(0xc00011a178)
    /workspace/pkg/apis/kubeflow.org/v1/pytorch_defaults.go:80 +0x5e5
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.SetObjectDefaults_PyTorchJob(...)
    /workspace/pkg/apis/kubeflow.org/v1/zz_generated.defaults.go:79
github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1.RegisterDefaults.func7({0x1ac1260?, 0xc00011a178?})
    /workspace/pkg/apis/kubeflow.org/v1/zz_generated.defaults.go:36 +0x32
k8s.io/apimachinery/pkg/runtime.(*Scheme).Default(0xc000535c70?, {0x1d8fb80?, 0xc00011a178})
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/runtime/scheme.go:347 +0xa4
github.com/kubeflow/training-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).onOwnerCreateFunc.func1({{0x1dbb288?, 0xc00011a178?}})
    /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:567 +0x6a
sigs.k8s.io/controller-runtime/pkg/predicate.Funcs.Create(...)
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/predicate/predicate.go:72
sigs.k8s.io/controller-runtime/pkg/source/internal.EventHandler.OnAdd({{0x1da59a8, 0x2aef128}, {0x1daead8, 0xc000316dc0}, {0xc0006bd950, 0x1, 0x1}}, {0x1ac1260?, 0xc00011a178})
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/source/internal/eventsource.go:57 +0x297
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/shared_informer.go:818 +0x134
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x10002af0680?)
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/wait/wait.go:157 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00027f738?, {0x1d8be20, 0xc000870000}, 0x1, 0xc000716360)
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/wait/wait.go:158 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000c31927?, 0x3b9aca00, 0x0, 0x71?, 0xc00027f7b0?)
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/wait/wait.go:135 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/wait/wait.go:92
k8s.io/client-go/tools/cache.(*processorListener).run(0xc0002fe680?)
    /go/pkg/mod/k8s.io/client-go@v0.25.3/tools/cache/shared_informer.go:812 +0x6b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/wait/wait.go:75 +0x5a
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start
    /go/pkg/mod/k8s.io/apimachinery@v0.25.3/pkg/util/wait/wait.go:73 +0x85
panic: runtime error: index out of range [0] with length 0 [recovered]
    panic: runtime error: index out of range [0] with length 0
srinandan commented 1 year ago

The problem is in the pytorch launcher. It appears the training operator does not like a master spec with {}. My PyTorchJob did not have a master spec.

kuizhiqing commented 1 year ago

Yes, it was confused here which need to set master spec in collective training mode, maybe we will solve it soon.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

johnugeorge commented 1 year ago

/good-first-issue

google-oss-prow[bot] commented 1 year ago

@johnugeorge: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubeflow/training-operator/issues/1842): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 10 months ago

/lifecycle frozen

sandipanpanda commented 8 months ago

I am interested in working on this good first issue. Can you explain the recommended solution? /assign