projectcalico / bird

Calico's fork of the BIRD protocol stack
91 stars 86 forks source link

bird cpu single core usage is always 100% #77

Closed jinnzy closed 4 years ago

jinnzy commented 4 years ago

Hello, our production environment encountered this problem, if there is any way to solve it would be greatly appreciated.

bird cpu single core usage is always 100% Use perf top to find that if_find_by_name function is higher using cpu image

Expected Behavior

cpu usage reduced to within normal range image

Current Behavior

image

Possible Solution

Steps to Reproduce (for bugs)

1. 2. 3. 4.

Context

Your Environment

jinnzy commented 4 years ago

Here is the relevant information found

https://trzepak.pl/viewtopic.php?t=63030 https://www.mail-archive.com/bird-users@network.cz/msg04492.html

fasaxc commented 4 years ago

Is your routing table particularly large? What do you get for ip route and ip addr?

jinnzy commented 4 years ago

Is your routing table particularly large? What do you get for ip route and ip addr?

# ip addr |wc -l 
1074
# ip route  |wc -l 
122

Is this too large routing table? I modified this parameter and the cpu usage has now returned to normal

scan time 10;       # Scan kernel routing table every 2 seconds

Thank you for your reply.

fasaxc commented 4 years ago

Are you running kube-proxy in IPVS mode; I believe it assigns every service VIP locally.

aligthart commented 4 years ago

We have a similar issue (we observe this: https://github.com/projectcalico/calico/issues/2992.) 3 masters on kubespray cluster. (kubespray 2.11.0 on ubuntu 16.04) Cluster is running fine for a number of weeks. Then starts failing. Note that this is our CI cluster so a lot of 'helm install' and 'helm delete' commands.

$ ip addr |wc -l 
14376

That looks like a very big number to me.
anything to check further?

Currently we just reboot masters one by one. Is there a cleaner fix/workarround?

fasaxc commented 4 years ago

@aligthart are you in kube-proxy IPVS mode? It adds every service IP to a local dummy device (due to a requirement of the IPVS stack).

aligthart commented 4 years ago

@fasaxc yes we are. Below the config map for kube-proxy

  config.conf: |-
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    bindAddress: 0.0.0.0
    clientConnection:
      acceptContentTypes: ""
      burst: 10
      contentType: application/vnd.kubernetes.protobuf
      kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
      qps: 5
    clusterCIDR: xxxxxxxxxxx
    configSyncPeriod: 15m0s
    conntrack:
      maxPerCore: 32768
      min: 131072
      tcpCloseWaitTimeout: 1h0m0s
      tcpEstablishedTimeout: 24h0m0s
    enableProfiling: false
    healthzBindAddress: 0.0.0.0:10256
    hostnameOverride:xxxxxxxxx
    iptables:
      masqueradeAll: false
      masqueradeBit: 14
      minSyncPeriod: 0s
      syncPeriod: 30s
    ipvs:
      excludeCIDRs: []
      minSyncPeriod: 0s
      scheduler: rr
      strictARP: false
      syncPeriod: 30s
    kind: KubeProxyConfiguration
    metricsBindAddress: 127.0.0.1:10249
    mode: ipvs
    nodePortAddresses: []
    oomScoreAdj: -999
    portRange: ""
    resourceContainer: /kube-proxy
    udpIdleTimeout: 250ms
    winkernel:
      enableDSR: false
      networkName: ""
      sourceVip: ""

should we switch to ip-tables? Anything else wrong in our proxy setup?

fasaxc commented 4 years ago

Switching to iptables mode should help with BIRD CPU, but presumably you're using IPVS mode for a reason (high numbers of services?)

aligthart commented 4 years ago

@fasaxc Thanks for the quick replies.Our cluster is relatively small so we do not benefit from the performance gains of ipvs yet. We will try with iptables and monitor the situation again. Note that after a reboot of our masters the 'ip addr' linecount was about 100.

spikecurtis commented 4 years ago

@jinnzy are you using IPVS? We fixed an issue related to it in https://github.com/projectcalico/confd/pull/314

If so I think this issue can be closed

jinnzy commented 4 years ago

@jinnzy are you using IPVS? We fixed an issue related to it in projectcalico/confd#314

If so I think this issue can be closed

Yes thank you very much.

wd commented 1 year ago

We meet the same issue today. We don't have too many nodes/routes. We have two pods that have the CPU/heathy issue. I fixed it by restarting the service.

fasaxc commented 1 year ago

@wd thsi is a very old issue, if you can reproduce on up-to-date calico please open a new issue.