projectcalico / bird

Calico's fork of the BIRD protocol stack
90 stars 86 forks source link

CPU usage of bird spikes to 100% and stays there for a while #102

Closed dbfancier closed 1 year ago

dbfancier commented 2 years ago

CPU usage of bird spikes to 100% and stays there for a while

Expected Behavior

Normally, we expect the Bird's CPU usage to be between 1% and 3%, otherwise it would take up training resources

Current Behavior

CPU usage of bird is usually around 30%, but occasionally spikes to 100% and stays there for a while.

We did a CPU hot spot analysis using perf and found that the CPU time was concentrated in the function if_find_by_name(about 84%) and if_find_by_index(about 12%).

so I send SIGUSR1 to bird for a dump. It shows that iface_list has 30000 ~ 40000 nodes. The index field of most nodes is 0 and flags include LINK-DOWN and SHUTDOWN, and MTU is 0 (The devices on which this happens all have the same prefix "cali", generate by cni-plugin).

These devices no longer exist on the host, but remain in iface_list. Our scenario is offline training, so many pods are created and deleted every day.

Now I rebuild the list using the extreme method of "kill bird"

I wonder if kif_scan() has a problem with the interface_list maintenance mechanism. We hope the community will help identify and fix the problem.

Thanks a lot.

Possible Solution

kill -SIGKILL "pidof bird"

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

image

Your Environment

splitice commented 2 years ago

Probably fixes #95

caseydavenport commented 2 years ago

This PR seems relevant, which aims to remove interfaces from BIRD when they go away: https://github.com/projectcalico/bird/pull/103

caseydavenport commented 1 year ago

This PR was merged to master recently: https://github.com/projectcalico/bird/pull/104

It looks like it has potential to fix this issue. We'll soak it and release it in v3.25 and hopefully we can close this then.

caseydavenport commented 1 year ago

Actually going to close this now as a duplicate of https://github.com/projectcalico/bird/issues/95