kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

Training Operator in CrashLoopBackOff #1841

Closed srinandan closed 1 year ago

srinandan commented 1 year ago

I have reviewed the issue here #1717 and it did not help.

WHAT DID YOU DO:

Deployed Kubeflow 1.7.0 to a 1.25.8-gke.1000 GKE cluster.

EXPECTED:

TrainingOperator runs without failure

ACTUAL:

TrainingOperator constantly restarts with CrashLoopBackOff

Logs from the container:

I0627 04:11:15.507256       1 request.go:682] Waited for 1.032730652s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/networking.gke.io/v1beta2?timeout=32s
1.687839078964336e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
I0627 04:11:26.106159       1 request.go:682] Waited for 1.045492516s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/coordination.k8s.io/v1?timeout=32s
I0627 04:11:36.106626       1 request.go:682] Waited for 5.045732967s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/dlp.cnrm.cloud.google.com/v1beta1?timeout=32s
I0627 04:11:46.107104       1 request.go:682] Waited for 3.046650022s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/cert-manager.io/v1?timeout=32s
I0627 04:11:56.156930       1 request.go:682] Waited for 1.095646314s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/networking.k8s.io/v1?timeout=32s
I0627 04:12:06.206975       1 request.go:682] Waited for 5.146861559s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/redis.cnrm.cloud.google.com/v1beta1?timeout=32s
I0627 04:12:16.256286       1 request.go:682] Waited for 3.195246168s due to client-side throttling, not priority and fairness, request: GET:https://10.28.208.1:443/apis/logging.cnrm.cloud.google.com/v1beta1?timeout=32s

I also tried to increase the resources (to 1 GB memory and 1 CPU) but it did not help.

ohtaro commented 1 year ago

I have same problem.

johnugeorge commented 1 year ago

@srinandan We haven't seen this in our testing. Is this something about your GKE setup? We are continuously testing in our CI and we didn't face any issues. Can you provide much higher resources and try again?

@ohtaro which is your K8s distribution?

tenzen-y commented 1 year ago

@srinandan I guess that kube-apiservers limit the number of API calls from the namespace in which the training operator is deployed, depending on the APF.

This limitation doesn't relate to training operator. Therefore, you should investigate which component sends many requests to kube-apiserver in your cluster many times.

In the future, training operator should provide ways to specify a priority for kube-client. > @johnugeorge

ohtaro commented 1 year ago

@ohtaro which is your K8s distribution?

I use 1.24.12-gke.1000 GKE Cluster and Kubeflow 1.7.0

johnugeorge commented 1 year ago

Looks like it is related to GKE env. Can you check deployment events to see if it is OOM? Is there something else interesting wrt GKE env?

ohtaro commented 1 year ago

Looking at the log and details, it did not appear that OOM was the cause.

resources

$ kubectl get deploy training-operator -o jsonpath='{.spec.template.spec.containers[].resources}'
{"requests":{"cpu":"1","memory":"2Gi"}}

logs

$ kubectl logs -f  training-operator-c5568476f-pklrg 
I0628 07:32:05.061752       1 request.go:682] Waited for 1.035932043s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/autoscaling/v2beta1?timeout=32s
1.6879375290677645e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": ":8080"}
I0628 07:32:15.111836       1 request.go:682] Waited for 5.945437309s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/serving.kserve.io/v1alpha1?timeout=32s
I0628 07:32:25.160946       1 request.go:682] Waited for 2.893194924s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/servicenetworking.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:32:35.161462       1 request.go:682] Waited for 6.341080517s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/cilium.io/v2?timeout=32s
I0628 07:32:45.161650       1 request.go:682] Waited for 3.243134713s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/extensions.istio.io/v1alpha1?timeout=32s
I0628 07:32:56.060789       1 request.go:682] Waited for 1.046218439s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s

describe

$ kubectl describe training-operator-c5568476f-pklrg  
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  32m                   default-scheduler  Successfully assigned kubeflow/training-operator-c5568476f-pklrg to gke-shared-test-clus-shared-test-e2-s-ee06b6d8-7gk9
  Normal   Created    31m (x2 over 32m)     kubelet            Created container training-operator
  Normal   Started    31m (x2 over 32m)     kubelet            Started container training-operator
  Warning  Unhealthy  31m                   kubelet            Readiness probe failed: Get "http://10.2.0.151:8081/readyz": read tcp 10.2.0.129:41126->10.2.0.151:8081: read: connection reset by peer
  Normal   Pulled     30m (x3 over 32m)     kubelet            Container image "kubeflow/training-operator:v1-d0fb5c0" already present on machine
  Warning  Unhealthy  30m (x6 over 31m)     kubelet            Liveness probe failed: Get "http://10.2.0.151:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Normal   Killing    30m (x2 over 31m)     kubelet            Container training-operator failed liveness probe, will be restarted
  Warning  Unhealthy  30m                   kubelet            Readiness probe failed: Get "http://10.2.0.151:8081/readyz": read tcp 10.2.0.129:55492->10.2.0.151:8081: read: connection reset by peer
  Warning  BackOff    7m23s (x62 over 26m)  kubelet            Back-off restarting failed container
  Warning  Unhealthy  2m16s (x45 over 32m)  kubelet            Readiness probe failed: Get "http://10.2.0.151:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
ohtaro commented 1 year ago

When I turned off liveness and readiness probe, it started successfully.

 k logs -f training-operator-7df886f775-fbcz8                                                                                                  +[master]
I0628 07:46:57.232885       1 request.go:682] Waited for 1.034094333s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/sql.cnrm.cloud.google.com/v1beta1?timeout=32s
1.6879384212421863e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": ":8080"}
I0628 07:47:07.233006       1 request.go:682] Waited for 5.892111591s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/memcache.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:47:17.233349       1 request.go:682] Waited for 2.795872533s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/dataproc.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:47:27.283351       1 request.go:682] Waited for 6.295453954s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/apiregistration.k8s.io/v1?timeout=32s
I0628 07:47:37.283418       1 request.go:682] Waited for 3.195678051s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/eventing.knative.dev/v1?timeout=32s
I0628 07:47:48.233334       1 request.go:682] Waited for 1.046508935s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/logging.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:47:58.283216       1 request.go:682] Waited for 4.545396598s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/networking.istio.io/v1beta1?timeout=32s
I0628 07:48:08.283348       1 request.go:682] Waited for 1.446039772s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0628 07:48:18.332904       1 request.go:682] Waited for 4.945341862s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/gkehub.cnrm.cloud.google.com/v1beta1?timeout=32s
1.6879384998425412e+09  INFO    setup   starting manager
1.6879384998429692e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6879384998430748e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6879384998433008e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6879384998433428e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6879384998433518e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.687938499843358e+09   INFO    Starting Controller {"controller": "xgboostjob-controller"}
...
johnugeorge commented 1 year ago

You can set higher timeout values of readiness and health checks in deployment manifests

ohtaro commented 1 year ago

Thank you!!

I added startupProbe with kustomize patch as follows.

example/patch-deploy-training-operator.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-operator
spec:
  template:
    spec:
      containers:
      - name: training-operator
        startupProbe:
          failureThreshold: 60
          periodSeconds: 5
          httpGet:
            path: /healthz
            port: 8081
            scheme: HTTP

example/kustomization.yaml

...
patches:
- path: patch-deploy-training-operator.yaml
srinandan commented 1 year ago

FWIW, for me, the problem eventually goes away. I didn't have to make any changes. I have repeated the install a few times since and each time the problem eventually goes away.

johnugeorge commented 1 year ago

It looks like timeout values are the edge in your case

hongbo-miao commented 1 year ago

I had same issue Kubeflow Training Operator in a local k3s Kubernetes created by Rancher Desktop.

Thanks @johnugeorge @ohtaro 's answers! 😃 Just summarize my final working version here to save some time for future people who meet same issue!

kubectl apply --kustomize=kubeflow-training-operator

kubeflow-training-operator/kustomization.yaml

resources:
  - github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0
patches:
  - path: training-operator-deployment-patch.yaml

kubeflow-training-operator/training-operator-deployment-patch.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: training-operator
spec:
  template:
    spec:
      containers:
        - name: training-operator
          startupProbe:
            httpGet:
              path: /healthz
              port: 8081
            failureThreshold: 30

Then it got deployed properly:

➜ kubectl logs training-operator-55b94b64f6-l2hzj -n kubeflow

I0714 05:41:03.499300       1 request.go:682] Waited for 1.041055501s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
1.6893132670509896e+09  INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": ":8080"}
I0714 05:41:14.295944       1 request.go:682] Waited for 1.048299708s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/registry.apicur.io/v1?timeout=32s
I0714 05:41:24.296220       1 request.go:682] Waited for 1.898473292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s
I0714 05:41:34.345574       1 request.go:682] Waited for 2.797829418s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/kafka.strimzi.io/v1beta1?timeout=32s
I0714 05:41:44.795474       1 request.go:682] Waited for 1.048790793s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
I0714 05:41:54.845438       1 request.go:682] Waited for 1.945571376s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1alpha1?timeout=32s
I0714 05:42:04.846740       1 request.go:682] Waited for 2.798720251s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/apm.k8s.elastic.co/v1beta1?timeout=32s
I0714 05:42:15.295114       1 request.go:682] Waited for 1.047853292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/node.k8s.io/v1?timeout=32s
1.689313340247459e+09   INFO    setup   starting manager
1.6893133402522147e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6893133402523057e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6893133402535763e+09  INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
1.6893133402535298e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.6893133402539213e+09  INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
1.6893133402539287e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402534788e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.689313340253958e+09   INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402535348e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6893133402540202e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402534842e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6893133402540386e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402540817e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6893133402540936e+09  INFO    Starting Controller {"controller": "pytorchjob-controller"}
1.6893133402541058e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.689313340254115e+09   INFO    Starting Controller {"controller": "tfjob-controller"}
1.6893133402534952e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6893133402541409e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.6893133402541444e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.6893133402542117e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6893133402542229e+09  INFO    Starting Controller {"controller": "mxjob-controller"}
1.6893133402542326e+09  INFO    Starting EventSource    {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
1.6893133402542348e+09  INFO    Starting Controller {"controller": "paddlejob-controller"}
1.6893133402542171e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6893133402542467e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.6893133402542505e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.6893133402542531e+09  INFO    Starting Controller {"controller": "mpijob-controller"}
1.689313340254083e+09   INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6893133402546306e+09  INFO    Starting Controller {"controller": "xgboostjob-controller"}
1.6893133403579748e+09  INFO    Starting workers    {"controller": "paddlejob-controller", "worker count": 1}
1.6893133403599951e+09  INFO    Starting workers    {"controller": "xgboostjob-controller", "worker count": 1}
1.6893133403601058e+09  INFO    Starting workers    {"controller": "pytorchjob-controller", "worker count": 1}
1.6893133403601074e+09  INFO    Starting workers    {"controller": "mxjob-controller", "worker count": 1}
1.6893133403601336e+09  INFO    Starting workers    {"controller": "tfjob-controller", "worker count": 1}
1.6893133403601432e+09  INFO    Starting workers    {"controller": "mpijob-controller", "worker count": 1}
johnugeorge commented 1 year ago

Closing as issue is resolved