projectcalico / bird

Calico's fork of the BIRD protocol stack
90 stars 86 forks source link

BIRD crashes a few seconds after startup #46

Closed aguerra closed 6 years ago

aguerra commented 7 years ago

Expected Behavior

It should keep running.

Current Behavior

It crashes.

Possible Solution

Steps to Reproduce (for bugs)

  1. Turn off all nodes and masters of a k8s 1.5.7 cluster (created with kops 1.5.3)
  2. Turn on all nodes and masters.
  3. After the cluster is up again some nodes have a unstable network as BIRD keeps crashing.
  4. After a while some nodes settle down, others don't.

Context

By turning the cluster off at night we can save money.

Your Environment

Backtrace:

GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bird...done.

warning: exec file is newer than core file.
[New LWP 8379]
Core was generated by `bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg'.
Program terminated with signal SIGILL, Illegal instruction.
#0  __fortify_FD_SET (__s=<optimized out>, __f=<optimized out>) at /usr/include/fortify/sys/select.h:44
44      /usr/include/fortify/sys/select.h: No such file or directory.
(gdb) bt
#0  __fortify_FD_SET (__s=<optimized out>, __f=<optimized out>) at /usr/include/fortify/sys/select.h:44
#1  io_loop () at io.c:2087
#2  0x0000000000400aac in main (argc=<optimized out>, argv=<optimized out>) at main.c:833
aguerra commented 7 years ago

I guess I have a clue...bird.cfg has a lot of old ip's that aren't used anymore and I think it's opening more than 1024 file descriptors (select limit).

robbrockbank commented 7 years ago

@aguerra: excellent sleuthing! It sounds as though there is a bunch of node configuration that needs tidying up. I suspect that if you're constantly spinning up and down new nodes, but are not explicitly deleting the node configuration for the torn down node - you'll end up leaking node resources. These are used to create the full BGP mesh - and so, as you suggest Bird will try and peer with each of the old nodes.

Could you use calicoctl to query the node configuration:

calicoctl get nodes

If this is returning a bunch of stale nodes, you'll need to delete them. you can also do this through calicoctl:

calicoctl delete node <name of node>

Longer term I think we either need to implement deletion of nodes in the Calico controller (based on when the node is deleted in Kubernetes, or perhaps introduce some form of TTL for the per-node data.

For now though I think the only option will be to explicitly delete the nodes.

aguerra commented 7 years ago

@robbrockbank Thanks, you've been very helpful, I see all stale nodes.

fasaxc commented 7 years ago

@robbrockbank Should we up the fd limit as a temporary stop-gap?

robbrockbank commented 7 years ago

@fasaxc - upping the limit would certainly be a good stop-gap.

fasaxc commented 6 years ago

There's a work in progress on a "node (cleanup) controller" here: https://github.com/projectcalico/kube-controllers/pull/176

ozdanborne commented 6 years ago

Closing as the node controller has been merged which cleans up stale nodes.