knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

[Intermittent- RHEL cluster] - Knative activator pod is restarting continuously from crashloop back of with Liveness and Readiness probe failure #15171

Closed Subhankar-Adak closed 1 month ago

Subhankar-Adak commented 6 months ago

What version of Knative?

V1.11.0

0.11.x

Output of git describe --dirty

Expected Behavior

As part of the Kserve deployment, we are deploying Istio, Cert Manager, and Knative as dependencies. Intermittently, we are facing an issue in the Knative deployment step where the Knative activator pod is not running properly and goes into crashloop backoff regularly. Other pods in the Knative namespace are running properly.

Versions of Dependencies:

Environment Details:

Activator Pod description Log:

Knative activator:

  Exit Code:    0
  Started:      Thu, 25 Apr 2024 12:43:47 +0000
  Finished:     Thu, 25 Apr 2024 12:46:50 +0000
Ready:          False
Restart Count:  4
Limits:
  cpu:     1
  memory:  600Mi
Requests:
  cpu:      300m
  memory:   60Mi
Liveness:   http-get http://:8012/ delay=15s timeout=1s period=10s #success=1 #failure=12
Readiness:  http-get http://:8012/ delay=0s timeout=1s period=5s #success=1 #failure=5
Environment:
  GOGC:                       500
  POD_NAME:                   activator-59dff6d45c-wqt8w (v1:metadata.name)
  POD_IP:                      (v1:status.podIP)
  SYSTEM_NAMESPACE:           knative-serving (v1:metadata.namespace)
  CONFIG_LOGGING_NAME:        config-logging
  CONFIG_OBSERVABILITY_NAME:  config-observability
  METRICS_DOMAIN:             knative.dev/internal/serving
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8nkbz (ro)

Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-8nkbz: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Scheduled 12m default-scheduler Successfully assigned knative-serving/activator-59dff6d45c-wqt8w to v16regressionnode00002 Normal Pulled 12m kubelet Container image "gcr.io/knative-releases/knative.dev/serving/cmd/activator@sha256:6b98eed95dd6dcc3d957e673aea3d271b768225442504316d713c08524f44ebe" already present on machine Normal Created 12m kubelet Created container activator Normal Started 12m kubelet Started container activator Warning Unhealthy 11m (x5 over 12m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500 Warning Unhealthy 2m20s (x137 over 12m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500

Activator pod logs:

[root@v16regressionnode00003 ~]# kubectl logs activator-7bcc758ddd-wk7cd -n knative-serving 2024/04/25 11:22:05 Registering 2 clients 2024/04/25 11:22:05 Registering 3 informer factories 2024/04/25 11:22:05 Registering 4 informers {"severity":"INFO","timestamp":"2024-04-25T11:22:05.716400581Z","logger":"activator","caller":"activator/main.go:140","message":"Starting the knative activator","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd"} {"severity":"INFO","timestamp":"2024-04-25T11:22:05.718542578Z","logger":"activator","caller":"activator/main.go:200","message":"Connecting to Autoscaler at ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd"} {"severity":"INFO","timestamp":"2024-04-25T11:22:05.718768882Z","logger":"activator","caller":"websocket/connection.go:161","message":"Connecting to ws://autoscaler.knative-serving.svc.cluster.local:8080","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd"} {"severity":"INFO","timestamp":"2024-04-25T11:22:05.719123778Z","logger":"activator","caller":"profiling/server.go:65","message":"Profiling enabled: false","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd"} {"severity":"INFO","timestamp":"2024-04-25T11:22:05.7237912Z","logger":"activator","caller":"activator/request_log.go:45","message":"Updated the request log template.","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd","template":""} {"severity":"WARNING","timestamp":"2024-04-25T11:22:06.685484891Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd"} {"severity":"WARNING","timestamp":"2024-04-25T11:22:07.686801714Z","logger":"activator","caller":"handler/healthz_handler.go:36","message":"Healthcheck failed: connection has not yet been established","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd"} {"severity":"ERROR","timestamp":"2024-04-25T11:22:08.719008181Z","logger":"activator","caller":"websocket/connection.go:144","message":"Websocket connection could not be established","commit":"f1617ef","knative.dev/controller":"activator","knative.dev/pod":"activator-7bcc758ddd-wk7cd","error":"dial tcp: lookup autoscaler.knative-serving.svc.cluster.local: i/o timeout","stacktrace":"knative.dev/pkg/websocket.NewDurableConnection.func1\n\tknative.dev/pkg@v0.0.0-20230718152110-aef227e72ead/websocket/connection.go:144\nknative.dev/pkg/websocket.(ManagedConnection).connect.func1\n\tknative.dev/pkg@v0.0.0-20230718152110-aef227e72ead/websocket/connection.go:225\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\tk8s.io/apimachinery@v0.26.5/pkg/util/wait/wait.go:222\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\tk8s.io/apimachinery@v0.26.5/pkg/util/wait/wait.go:235\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\tk8s.io/apimachinery@v0.26.5/pkg/util/wait/wait.go:228\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\tk8s.io/apimachinery@v0.26.5/pkg/util/wait/wait.go:423\nknative.dev/pkg/websocket.(ManagedConnection).connect\n\tknative.dev/pkg@v0.0.0-20230718152110-aef227e72ead/websocket/connection.go:222\nknative.dev/pkg/websocket.NewDurableConnection.func2\n\tknative.dev/pkg@v0.0.0-20230718152110-aef227e72ead/websocket/connection.go:162"}

Actual Behavior

Steps to Reproduce the Problem

  1. Deploy Kubernetes v1.26.12 using Kubespray on RHEL 8.8 cluster.
  2. Deploy Istio v1.17.0.
  3. Deploy Cert Manager v1.13.
  4. Deploy Knative using the following steps:
skonto commented 6 months ago

Hi @Subhankar-Adak could you provide the status of pods in knative-serving ns. It seems from your logs that activator cannot connect to the autoscaler pod and fails. Could you provide the logs of the autoscaler pod too?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.