Implement maxDelay option for api server rate limiting

dougbtv commented 6 months ago

See: https://danielmangum.com/posts/controller-runtime-client-go-rate-limiting/#the-default-controller-rate-limiter

This could prove to improve how we're rate limited by the api server at scale.

mrbojangles3 commented 6 months ago

One of the possible symptoms of not having supplying your own timeout for a request is below, note the time and durations

Warning  FailedCreatePodSandBox  3m13s (x13677 over 3d17h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_samplepod-bridge-26tpw_default_17cdc8d1-8fd7-4e92-8740-5bc89f7ec65f_0(34f52fc40f083cd7ddf77bf04005c0a6f815e967b073f2125441d62dc37a8c33): error adding pod default_samplepod-bridge-26tpw to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:34f52fc40f083cd7ddf77bf04005c0a6f815e967b073f2125441d62dc37a8c33 Netns:/var/run/netns/ce3f83ba-b99e-41bc-ad37-5925193be0c3 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=samplepod-bridge-26tpw;K8S_POD_INFRA_CONTAINER_ID=34f52fc40f083cd7ddf77bf04005c0a6f815e967b073f2125441d62dc37a8c33;K8S_POD_UID=17cdc8d1-8fd7-4e92-8740-5bc89f7ec65f Path:
<snip>
ContainerID:"34f52fc40f083cd7ddf77bf04005c0a6f815e967b073f2125441d62dc37a8c33" Netns:"/var/run/netns/ce3f83ba-b99e-41bc-ad37-5925193be0c3" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=samplepod-bridge-26tpw;K8S_POD_INFRA_CONTAINER_ID=34f52fc40f083cd7ddf77bf04005c0a6f815e967b073f2125441d62dc37a8c33;K8S_POD_UID=17cdc8d1-8fd7-4e92-8740-5bc89f7ec65f" Path:"" ERRORED: error configuring pod [default/samplepod-bridge-26tpw] networking: [default/samplepod-bridge-26tpw/17cdc8d1-8fd7-4e92-8740-5bc89f7ec65f:bridge-whereabouts-10-2]: error adding container to network "bridge-whereabouts-10-2": error at storage engine: k8s get error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

mrbojangles3 commented 5 months ago

Oddly the lack of rate limiting leads to overall slower performance. I tried to launch 2500 pods with whereabouts ip addresses.

QPS / Burst	P99 Latency	Backoff encountered and total runtime
20/20	699 sec	Yes, 20 min
10/10	509 sec	Yes, 16 min
5/5	483 sec	Yes, 20 min
1/1	3 seconds	No, 45 min

mrbojangles3 commented 5 months ago

~An update to the above table is coming~ Updated on Fed 8. Assuming the maxDelay option is the cause, which I am not sure it is. The issue of the slowness might need to be addressed as a bug and not an enhancement. Depending on the amount of pods needing to come back.

Zorlin commented 3 months ago

Right now this issue means this project is extremely limiting for our cluster... we need to spawn in about 10,000 pods across 60 or so k8s workers (currently prototyping with 4000 pods across 24) and the spawn rate is about 0.5 to 1 pods per second. A fix for this issue would be fantastic for our use case! I'm trying to figure out how to implement it myself :)

k8snetworkplumbingwg / whereabouts

Implement maxDelay option for api server rate limiting #395