teamhephy / router

MIT License
4 stars 10 forks source link

Throttling on Kubernetes calls during configuration reload #59

Closed giannoul closed 3 years ago

giannoul commented 3 years ago

There are cases where the router config reload takes over a minute due to the fact that its requests are being throttled by k8s.

The router uses the cfg, err := rest.InClusterConfig(). As per InClusterConfig:

InClusterConfig returns a config object which uses the service account kubernetes gives to pods. It's intended for clients that expect to be running inside a pod running on kubernetes. It will return ErrNotInCluster if called from a process not running in a kubernetes environment.

The Config returned by the InClusterConfig has the default values for QPS and Burst (they are both equal to 0):

// QPS indicates the maximum QPS to the master from this client. // If it's zero, the created RESTClient will use DefaultQPS: 5 QPS float32

// Maximum burst for throttle. // If it's zero, the created RESTClient will use DefaultBurst: 10. Burst int

The above means that when router tries to fetch the certificates it will be throttled. In the case of a big number of certificates this leads to a significant delay during configuration reloads. In a specific case with ~200 certificates the delay is almost 120 seconds due to the throttling. The situation gets worse when multiple hephy routers run as part of the deployment, due to the fact that they use the same Service Account, so they all try to make requests to the control plane.

Cryptophobia commented 3 years ago

So the important thing here is to increase cfg.QPS so we can handle higher load without hitting the throttling limits. Burst is not as important until we run many routers, which makes cfg.Burst important. Is this the reason why you prefer to set cfg.Burst set to 0 when default in #60 ?

If it's zero, the created RESTClient will use DefaultBurst: 10.

giannoul commented 3 years ago

The main idea behind the issue and fix in #60 is to be able to set the cfg.QPS and cfg.Burst using environment variables. In many cases the default values of 0 for both of them (resulting to cfg.QPS = 5 and cfg.Burst = 10 as per the docs) will do just fine, but in some edge cases like e.g. many certificates they will lead to significant delays due to throttling.

The implementation in #60 will read the environment variables for cfg.QPS (named RATE_LIMIT_QPS) and cfg.Burst (named RATE_LIMIT_BURST) and do the appropriate settings. Any of cfg.QPS and cfg.Burst will default to 0 if the relevant environment variable is not set. This was done in order to ensure that the changes will take effect only if we decide to set the environment variable.

Cryptophobia commented 3 years ago

@giannoul , should we add some helm chart values in the values.yaml to be set on the deployment object? If they are not passed in via values, then we can default to the safe defaults of "" before we close this issue out.

https://github.com/teamhephy/router/blob/master/charts/router/values.yaml https://github.com/teamhephy/router/blob/master/charts/router/templates/router-deployment.yaml#L58

giannoul commented 3 years ago

@Cryptophobia I have created the PR https://github.com/teamhephy/router/pull/61 that should handle the above.

The logic is that we can set for example the values:

rate_limit:
   qps: "50.0"
   burst: "50"

in values.yaml and they will populate the deployment's environment variables. Since a non existing environment variable is set to "" by default by os.Getenv it will prevent the setting of relevant configuration variable, leading to the usage of the default values (the default will be 0).

The reason I chose qps and burst to be fields of the same basic variable rate_limit is that they are connected in the way that someone will set them. For example, if we only set qps the configuration will fail with message:

Failed to create client: Burst is required to be greater than 0 when RateLimiter is not set and QPS is set to greater than 0.

So, someone should keep in mind that those variables are related.

Cryptophobia commented 3 years ago

Great idea to link the settings under the same basic variable rate_limit. I agree that if they need to be set together, this is best practice!