Closed aguerra closed 6 years ago
I guess I have a clue...bird.cfg has a lot of old ip's that aren't used anymore and I think it's opening more than 1024 file descriptors (select limit).
@aguerra: excellent sleuthing! It sounds as though there is a bunch of node configuration that needs tidying up. I suspect that if you're constantly spinning up and down new nodes, but are not explicitly deleting the node configuration for the torn down node - you'll end up leaking node resources. These are used to create the full BGP mesh - and so, as you suggest Bird will try and peer with each of the old nodes.
Could you use calicoctl
to query the node configuration:
calicoctl get nodes
If this is returning a bunch of stale nodes, you'll need to delete them. you can also do this through calicoctl:
calicoctl delete node <name of node>
Longer term I think we either need to implement deletion of nodes in the Calico controller (based on when the node is deleted in Kubernetes, or perhaps introduce some form of TTL for the per-node data.
For now though I think the only option will be to explicitly delete the nodes.
@robbrockbank Thanks, you've been very helpful, I see all stale nodes.
@robbrockbank Should we up the fd limit as a temporary stop-gap?
@fasaxc - upping the limit would certainly be a good stop-gap.
There's a work in progress on a "node (cleanup) controller" here: https://github.com/projectcalico/kube-controllers/pull/176
Closing as the node controller has been merged which cleans up stale nodes.
Expected Behavior
It should keep running.
Current Behavior
It crashes.
Possible Solution
Steps to Reproduce (for bugs)
Context
By turning the cluster off at night we can save money.
Your Environment
Backtrace: