Delay between Calico pick IP address and assign address to pod

hmlkao commented 2 years ago

Expected Behavior

When I scale up deployment from 1 to 100 replicas which are all scheduled to one node, it takes about 2-3 minutes then all of them obtain IP address. I expect that IPs will be assigned much faster.

Current Behavior

The time between the scheduling the pod to node and assigning of IP address to this pod increases according to the amount of pods scheduled to this node. In other words when I set 10 replicas addresses will be assigned much faster then I set 100 replicas.

Possible Solution

I'm not sure where exactly the delay occurs, see the Example below.

I've tried:

to allocate calico-node pods much more resources but without any difference in behaviour.
to set GOMAXPROCS as env variable for calico-node pods also without success.

Steps to Reproduce (for bugs)

Cordon all workers except one
Scale up deployment of any app from 1 to 100 replicas
Check time when the last scheduled pod obtain IP address

Context

When dev team release new app version and many of pods are scheduled to one of workers there is significant delay between pod is really Running. We are running K8s on powerful bare metal workers (about 100 vCPUs and 512 GB RAM) using Calico as CNI.

Your Environment

Calico v3.20.3
K8s v1.21.9
Ubuntu 20.04.3 LTS
HWE kernel 5.11.0-40-generic

Example timeline of issue

Test done on one "clean" worker in our lab environment without any other workload, logs were take from console, consul-connect-inject-init container within an app pod and calico-node pod

Full logs are in files consul-init.log and calico-node.log.

12:11:50 - deployment demo-ondra-appmain-dev scaled to 100 replicas
12:12:02 - one of pods (demo-ondra-appmain-dev-798749569d-zjdtl) created (in Pending state, scheduled to k8s-dc1-016)
12:12:13 - pod is in Init state
12:12:19 - calico-node pod first record about this pod
12:12:19 - calico-node chose IP address to be assigned to pod < YES! since this time is IP known
12:12:32 - connect-inject-init container start looking for service
12:13:22 - IP address assigned to pod
12:13:25 - connect-inject-init container is succesful > Init containers were completed > Running state

If you need any other details, let me know.

It would be great to know where delay occurs and what is bottle neck of this behavior.

caseydavenport commented 2 years ago

@hmlkao a few things to point out:

calico/node doesn't allocate IP addresses. This is performed by the Calico CNI plugin, so what you're seeing in the calico-node.log is when calico/node learns about the IP that has been allocated by the Calico CNI plugin (which logs to a different location).
There are a lot of moving parts in scaling up that many pods at once, including rate limits within Kubernetes itself. One issue that has been known for a long time is that even when IP addresses are allocated, the kubelet will batch updates to the API to reduce load, meaning IPs may not appear in the Kubernetes API for some time.

What makes you think that Calico is at fault here? It very well might be, but from what I can tell the pod goes into Init state at 11:12:13, and then calico/node learns about the IP 11:12:18, only 5 seconds later. That suggests that the process of launching a container, allocating an IP, configuring routes and sysctls, and communicating the IP to calico/node takes 5 seconds which is pretty reasonable.

12:13:22 - IP address assigned to pod

How are you measuring this? From the log, it appears the IP is assigned well before that (at 11:12:18)

If I had to guess, what you are seeing is that Calico is allocating the IP addresses immediately, but that kubelet is not reporting the IP back to the API until some time later due to batching / rate-limiting mentioned above. This is the original issue about that, which wasn't deemed serious enough to fix: https://github.com/kubernetes/kubernetes/issues/39113

hmlkao commented 2 years ago

@caseydavenport , thanks for your response and notes.

What I was tough is that the delay is created between Calico allocate an IP address from range and writing it to "somewhere" (I don't know how it is processed) on node. But as you mentioned the delay can be caused by kubelet before it report it back to API. However, pod is able to create connections to network after the IP address is reported which should be IMO independent from reporting IP address to API (I suppose that the connection should be available right after the interface is created). It's little bit confusing for me.

Maybe I could check interfaces created on node during a test (which I didn't). Or can you advise me what would I should look for?

How are you measuring this?

"Measured" as simply as I could by refreshing the kubectl in terminal

Thanks for the link to issue, I'll try some tests with modified kubelet parameters.

hmlkao commented 2 years ago

Unfortunately, changing of kubelet parameters (kube-api-qps, kube-api-burst, event-qps and event-burst) doesn't help. The delay is still the same.

caseydavenport commented 2 years ago

@hmlkao as discovered on the original issue, those parameters do help somewhat in extreme circumstances but they are not the only problem here. Those are API client ratelimiting parameters, whereas this is a kubelet specific piece of batching code which further limits this, which I do not believe to be configurable.

"Measured" as simply as I could by refreshing the kubectl in terminal

The vast majority of the time before this shows up in kubectl is likely to be outside of Calico's control. Based on the logs you provided Calico appears to be assigning an address and returning it to Kubernetes quickly, so I don't think this is a Calico issue, rather a problem elsehwere in the system.

pod is able to create connections to network after the IP address is reported which should be IMO independent from reporting IP address to API

I would expect the pod to have network access very shortly after the IP is assigned to the interface, which will happen before it is reported in the API. However, note that some network functions (like services) require the IP to appear in the API in order to function because the IP needs to be known on remote nodes as well, so not all network access will succeed until the IP does appear in the API.

Or can you advise me what would I should look for?

12:12:32 - connect-inject-init container start looking for service

Could you explain what this init container does? Is this an init container in the pod that is launching?

hmlkao commented 2 years ago

Sorry for the delay, I want to give you a comprehensive answer so I'd to read some more information.

Could you explain what this init container does? Is this an init container in the pod that is launching?

This init container is part of Consul Connect (service mesh implementation by Consul), container is injected to pod by Consul mutating webhook, this container is trying to open connection to Consul Agent daemonset on node (via HOST_IP and nodePort ), it is looking for associated service registered in Consul catalog to prepare config for Envoy sidecar running beside app container.

There is K8s deployment and pod of app: deployment.yaml.txt, pod.yaml.txt

I thought that it could be a problem with overloaded Consul Agent so I did a test with app without this init container and the behavior is similar. So it isn't caused by Consul at all.

I've done more tests with check when is container created on node and looked at tons of logs and I'm more and more convinced that the problem is outside of Calico scope. So thank you very much for cooperation, I'll try to investigate elsewhere.

caseydavenport commented 2 years ago

Ok cool, I'm going to close this for now but if you do find more evidence that this is related to Calico just shout and we can reopen.

projectcalico / calico