oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.
To Reproduce
Steps to reproduce the behavior:
Deploy about 80-120 inference models with routes.
A config that ALWAYS reproduces the problem:
6 namespaces, each with 2 model mesh pods, and 800 inference models per
MINIO_MODEL_COUNT=800
NS_COUNT=6
Expected behavior
All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.
Actual behavior
oauth liveness probes are missed:
Warning Unhealthy 5m50s (x664 over 23h) kubelet Liveness probe failed: Get "https://10.131.0.35:8443/oauth/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints.
As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.
Describe the bug
oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.
To Reproduce Steps to reproduce the behavior:
Deploy about 80-120 inference models with routes. A config that ALWAYS reproduces the problem: 6 namespaces, each with 2 model mesh pods, and 800 inference models per MINIO_MODEL_COUNT=800 NS_COUNT=6
Expected behavior
All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.
Actual behavior
oauth liveness probes are missed:
Leading to:
And obviously:
In fact, all model mesh instances (pods) are unstable, due to the oauth-proxy container failing its liveness probes.
Environment (please complete the following information):
ODH
Additional context
The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints. As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.