Open thorweijie opened 1 month ago
Hi @thorweijie!
After all the inference services were restarted, we noticed the istio-proxy container in activator pods were having high cpu usage and health checks were failing with response code 0
What healthchecks were failing, the activator ones?
We noticed that despite being skipped, the activator pods were still trying to perform health checks with response code 0 until they were restarted. We would like to know if the health checks for activator are cached, and whether the frequency of the health checks can be configured?
The probing mechanism is started when endpoints are created/updated with a default frequency of 200ms. If probing finished successfully you should see this msg assuming you enable activator debug logging:
{"severity":"DEBUG","timestamp":"2024-10-23T14:20:52.082125337Z","logger":"activator","caller":"net/revision_backends.go:348","message":"Done probing, got 1 healthy pods","commit":"0abee66","knative.dev/controller":"activator","knative.dev/pod":"activator-8675c9944c-mdfj9","knative.dev/key":"default/autoscale-go-00001"}
Once all pods are ready (and stay that way) probing should stop. The idea is that activator is in standby mode to handle traffic and so each activator instance needs to know ready targets so it can route traffic to them if needed. Afaik there is no caching. Maybe @ReToCode, @dprotaso have more to say here.
Ask your question here:
We have a kubernetes cluster with many inference services. After all the inference services were restarted, we noticed the istio-proxy container in activator pods were having high cpu usage and health checks were failing with response code 0, so we set target burst capacity to 0 to bypass the activator and fix the issue. We noticed that despite being skipped, the activator pods were still trying to perform health checks with response code 0 until they were restarted. We would like to know if the health checks for activator are cached, and whether the frequency of the health checks can be configured?