Open gojoy opened 1 week ago
Are you running Typha? The allocate-tunnel-ips program will use typha if enabled, which is our fan-out proxy. Using that removes the load from the API server.
Yeah, agree that Typha is likely the solution here. If there's a reason that even with Typha this is causing load, that is most likely a bug in the allocateip.go
code - I don't think we need / should have an env var to control whether this code is enabled assuming the code is written efficiently.
Some users don't use Typha to install calico, we still need the env var until we figure out a potential bug in the allocateip.go
, do you think so? @caseydavenport
@cyclinder if there truly is a bug in allocateip.go that is causing excessive resource usage, I'd rather we just fix that - that way, everyone benefits and not just those who discover this new arcane environment variable.
In our scenario, the Kubernetes cluster fully uses the calico BGP mode and no longer starts process calico-node -allocate-tunnel-addrs to reduce the pressure of Calico on the backend
@gojoy do you have evidence that allocate-tunnel-addrs was causing load in your cluster? If so, could you provide that evidence so we can use it to diagnose what might be going wrong?
@cyclinder if there truly is a bug in allocateip.go that is causing excessive resource usage, I'd rather we just fix that - that way, everyone benefits and not just those who discover this new arcane environment variable.
In our scenario, the Kubernetes cluster fully uses the calico BGP mode and no longer starts process calico-node -allocate-tunnel-addrs to reduce the pressure of Calico on the backend
@gojoy do you have evidence that allocate-tunnel-addrs was causing load in your cluster? If so, could you provide that evidence so we can use it to diagnose what might be going wrong?
Yes, we've noticed that each time reconcileTunnelAddrs is executed, the backend service will be called multiple times. When a large number of new nodes join the cluster, it will increase the load on the backend. In the scenario where the tunnel is enabled, this belongs to the normal logic. However, the key point is that when the cluster administrator determines that the tunnel will not be used in the cluster, the question is whether there can be a variable to control stopping this process, since these reconciles responsible for the tunnel have no effect. This may not be in conflict with enabling Typha.
allocateip.go
One more point to add. When using the Kubernetes API Datastore, allocateip.go requests the kubernetes apiserver for queries with resourceVersion="", which will result in not using the cache of Kubernetes and increase the overhead of requests.
@gojoy I still am not convinced that an environment variable to control this is the correct approach.
Felix also watches Nodes(), and will have similar scale characteristics that require the use of Typha at even a moderate number of nodes. I think the correct solution here is to enable Typha in your cluster - it's exactly what it was designed and intended for!
that is most likely a bug in the allocateip.go code
There is an actual "bug" here, which is that we are performing a Get() directly against the API server for the node instead of using a cached version. Specifically these lines: https://github.com/projectcalico/calico/blob/07ad564f962be48c14c38abfaa159319770bda6b/node/pkg/allocateip/allocateip.go#L225-L235
I think there is a strong case to be made for improving that to be cache driven instead of generating API calls.
Expected Behavior
A environment variable is needed to control whether process calico-node -allocate-tunnel-addrs is started when run the calico pod in k8s.
Current Behavior
Possible Solution
add the a judgment by CALICO_DISABLE_TUNNTL env in the rc.local
Steps to Reproduce (for bugs)
1. 2. 3. 4.
Context
In our scenario, the Kubernetes cluster fully uses the calico BGP mode and no longer starts process calico-node -allocate-tunnel-addrs to reduce the pressure of Calico on the backend
Your Environment