opendatahub-io / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
3 stars 32 forks source link

oauth-proxy's CPU limit is far too low #62

Open fcami opened 1 year ago

fcami commented 1 year ago

Describe the bug

oauth-proxy's CPU limit is far too low once a certain number of routes are created, to answer liveness probes in a timely fashion.

To Reproduce Steps to reproduce the behavior:

Deploy about 80-120 inference models with routes. A config that ALWAYS reproduces the problem: 6 namespaces, each with 2 model mesh pods, and 800 inference models per MINIO_MODEL_COUNT=800 NS_COUNT=6

Expected behavior

All model mesh pods deployed, all inference services routes created, everything ready to service inference requests.

Actual behavior

oauth liveness probes are missed:

  Warning  Unhealthy  5m50s (x664 over 23h)  kubelet  Liveness probe failed: Get "https://10.131.0.35:8443/oauth/healthz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Leading to:

  Warning  BackOff    10m (x5667 over 22h)   kubelet  Back-off restarting failed container

And obviously:

modelmesh-serving-ovms-1.x-5bbbf88fdf-spxlw   4/5     CrashLoopBackOff   490 (3m27s ago)   23h

In fact, all model mesh instances (pods) are unstable, due to the oauth-proxy container failing its liveness probes.

Environment (please complete the following information):

ODH

Additional context

The CPU limit is probably set far too low (100m) for a SSL-terminated endpoints. As flooding the endpoint might be enough to cause system instability and therefore DoS, maybe setting a much higher limit (or no CPU limit at all) would be better.

heyselbi commented 11 months ago

@fcami is this still an issue? What's the reason for higher limit for oauth-proxy?

fcami commented 11 months ago

The reason is explained in the original post. I cannot test anymore, so please do what you will.