Closed srinandan closed 1 year ago
I have same problem.
@srinandan We haven't seen this in our testing. Is this something about your GKE setup? We are continuously testing in our CI and we didn't face any issues. Can you provide much higher resources and try again?
@ohtaro which is your K8s distribution?
@srinandan I guess that kube-apiservers limit the number of API calls from the namespace in which the training operator is deployed, depending on the APF.
This limitation doesn't relate to training operator. Therefore, you should investigate which component sends many requests to kube-apiserver in your cluster many times.
In the future, training operator should provide ways to specify a priority for kube-client. > @johnugeorge
@ohtaro which is your K8s distribution?
I use 1.24.12-gke.1000 GKE Cluster and Kubeflow 1.7.0
Looks like it is related to GKE env. Can you check deployment events to see if it is OOM? Is there something else interesting wrt GKE env?
Looking at the log and details, it did not appear that OOM was the cause.
resources
$ kubectl get deploy training-operator -o jsonpath='{.spec.template.spec.containers[].resources}'
{"requests":{"cpu":"1","memory":"2Gi"}}
logs
$ kubectl logs -f training-operator-c5568476f-pklrg
I0628 07:32:05.061752 1 request.go:682] Waited for 1.035932043s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/autoscaling/v2beta1?timeout=32s
1.6879375290677645e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
I0628 07:32:15.111836 1 request.go:682] Waited for 5.945437309s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/serving.kserve.io/v1alpha1?timeout=32s
I0628 07:32:25.160946 1 request.go:682] Waited for 2.893194924s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/servicenetworking.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:32:35.161462 1 request.go:682] Waited for 6.341080517s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/cilium.io/v2?timeout=32s
I0628 07:32:45.161650 1 request.go:682] Waited for 3.243134713s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/extensions.istio.io/v1alpha1?timeout=32s
I0628 07:32:56.060789 1 request.go:682] Waited for 1.046218439s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/rbac.authorization.k8s.io/v1?timeout=32s
describe
$ kubectl describe training-operator-c5568476f-pklrg
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 32m default-scheduler Successfully assigned kubeflow/training-operator-c5568476f-pklrg to gke-shared-test-clus-shared-test-e2-s-ee06b6d8-7gk9
Normal Created 31m (x2 over 32m) kubelet Created container training-operator
Normal Started 31m (x2 over 32m) kubelet Started container training-operator
Warning Unhealthy 31m kubelet Readiness probe failed: Get "http://10.2.0.151:8081/readyz": read tcp 10.2.0.129:41126->10.2.0.151:8081: read: connection reset by peer
Normal Pulled 30m (x3 over 32m) kubelet Container image "kubeflow/training-operator:v1-d0fb5c0" already present on machine
Warning Unhealthy 30m (x6 over 31m) kubelet Liveness probe failed: Get "http://10.2.0.151:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Normal Killing 30m (x2 over 31m) kubelet Container training-operator failed liveness probe, will be restarted
Warning Unhealthy 30m kubelet Readiness probe failed: Get "http://10.2.0.151:8081/readyz": read tcp 10.2.0.129:55492->10.2.0.151:8081: read: connection reset by peer
Warning BackOff 7m23s (x62 over 26m) kubelet Back-off restarting failed container
Warning Unhealthy 2m16s (x45 over 32m) kubelet Readiness probe failed: Get "http://10.2.0.151:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
When I turned off liveness and readiness probe, it started successfully.
k logs -f training-operator-7df886f775-fbcz8 +[master]
I0628 07:46:57.232885 1 request.go:682] Waited for 1.034094333s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/sql.cnrm.cloud.google.com/v1beta1?timeout=32s
1.6879384212421863e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
I0628 07:47:07.233006 1 request.go:682] Waited for 5.892111591s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/memcache.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:47:17.233349 1 request.go:682] Waited for 2.795872533s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/dataproc.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:47:27.283351 1 request.go:682] Waited for 6.295453954s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/apiregistration.k8s.io/v1?timeout=32s
I0628 07:47:37.283418 1 request.go:682] Waited for 3.195678051s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/eventing.knative.dev/v1?timeout=32s
I0628 07:47:48.233334 1 request.go:682] Waited for 1.046508935s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/logging.cnrm.cloud.google.com/v1beta1?timeout=32s
I0628 07:47:58.283216 1 request.go:682] Waited for 4.545396598s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/networking.istio.io/v1beta1?timeout=32s
I0628 07:48:08.283348 1 request.go:682] Waited for 1.446039772s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/node.k8s.io/v1beta1?timeout=32s
I0628 07:48:18.332904 1 request.go:682] Waited for 4.945341862s due to client-side throttling, not priority and fairness, request: GET:https://10.0.16.1:443/apis/gkehub.cnrm.cloud.google.com/v1beta1?timeout=32s
1.6879384998425412e+09 INFO setup starting manager
1.6879384998429692e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6879384998430748e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6879384998433008e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6879384998433428e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6879384998433518e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.687938499843358e+09 INFO Starting Controller {"controller": "xgboostjob-controller"}
...
You can set higher timeout values of readiness and health checks in deployment manifests
Thank you!!
I added startupProbe with kustomize patch as follows.
example/patch-deploy-training-operator.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: training-operator
spec:
template:
spec:
containers:
- name: training-operator
startupProbe:
failureThreshold: 60
periodSeconds: 5
httpGet:
path: /healthz
port: 8081
scheme: HTTP
example/kustomization.yaml
...
patches:
- path: patch-deploy-training-operator.yaml
FWIW, for me, the problem eventually goes away. I didn't have to make any changes. I have repeated the install a few times since and each time the problem eventually goes away.
It looks like timeout values are the edge in your case
I had same issue Kubeflow Training Operator in a local k3s Kubernetes created by Rancher Desktop.
Thanks @johnugeorge @ohtaro 's answers! 😃 Just summarize my final working version here to save some time for future people who meet same issue!
kubectl apply --kustomize=kubeflow-training-operator
kubeflow-training-operator/kustomization.yaml
resources:
- github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0
patches:
- path: training-operator-deployment-patch.yaml
kubeflow-training-operator/training-operator-deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: training-operator
spec:
template:
spec:
containers:
- name: training-operator
startupProbe:
httpGet:
path: /healthz
port: 8081
failureThreshold: 30
Then it got deployed properly:
➜ kubectl logs training-operator-55b94b64f6-l2hzj -n kubeflow
I0714 05:41:03.499300 1 request.go:682] Waited for 1.041055501s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
1.6893132670509896e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
I0714 05:41:14.295944 1 request.go:682] Waited for 1.048299708s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/registry.apicur.io/v1?timeout=32s
I0714 05:41:24.296220 1 request.go:682] Waited for 1.898473292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s
I0714 05:41:34.345574 1 request.go:682] Waited for 2.797829418s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/kafka.strimzi.io/v1beta1?timeout=32s
I0714 05:41:44.795474 1 request.go:682] Waited for 1.048790793s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
I0714 05:41:54.845438 1 request.go:682] Waited for 1.945571376s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1alpha1?timeout=32s
I0714 05:42:04.846740 1 request.go:682] Waited for 2.798720251s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/apm.k8s.elastic.co/v1beta1?timeout=32s
I0714 05:42:15.295114 1 request.go:682] Waited for 1.047853292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/node.k8s.io/v1?timeout=32s
1.689313340247459e+09 INFO setup starting manager
1.6893133402522147e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6893133402523057e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6893133402535763e+09 INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
1.6893133402535298e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.6893133402539213e+09 INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
1.6893133402539287e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402534788e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.689313340253958e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402535348e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6893133402540202e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402534842e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6893133402540386e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402540817e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6893133402540936e+09 INFO Starting Controller {"controller": "pytorchjob-controller"}
1.6893133402541058e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.689313340254115e+09 INFO Starting Controller {"controller": "tfjob-controller"}
1.6893133402534952e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6893133402541409e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.6893133402541444e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.6893133402542117e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6893133402542229e+09 INFO Starting Controller {"controller": "mxjob-controller"}
1.6893133402542326e+09 INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
1.6893133402542348e+09 INFO Starting Controller {"controller": "paddlejob-controller"}
1.6893133402542171e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6893133402542467e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.6893133402542505e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.6893133402542531e+09 INFO Starting Controller {"controller": "mpijob-controller"}
1.689313340254083e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6893133402546306e+09 INFO Starting Controller {"controller": "xgboostjob-controller"}
1.6893133403579748e+09 INFO Starting workers {"controller": "paddlejob-controller", "worker count": 1}
1.6893133403599951e+09 INFO Starting workers {"controller": "xgboostjob-controller", "worker count": 1}
1.6893133403601058e+09 INFO Starting workers {"controller": "pytorchjob-controller", "worker count": 1}
1.6893133403601074e+09 INFO Starting workers {"controller": "mxjob-controller", "worker count": 1}
1.6893133403601336e+09 INFO Starting workers {"controller": "tfjob-controller", "worker count": 1}
1.6893133403601432e+09 INFO Starting workers {"controller": "mpijob-controller", "worker count": 1}
Closing as issue is resolved
I have reviewed the issue here #1717 and it did not help.
WHAT DID YOU DO:
Deployed Kubeflow 1.7.0 to a 1.25.8-gke.1000 GKE cluster.
EXPECTED:
TrainingOperator runs without failure
ACTUAL:
TrainingOperator constantly restarts with CrashLoopBackOff
Logs from the container:
I also tried to increase the resources (to 1 GB memory and 1 CPU) but it did not help.