osrg / gobgp

BGP implemented in the Go Programming Language
https://osrg.github.io/gobgp/
Apache License 2.0
3.64k stars 694 forks source link

GoBGP crashed on Raspberry Pi #1575

Closed yupi closed 6 years ago

yupi commented 6 years ago

I set up my rpi as a route reflector using https://github.com/osrg/gobgp/releases/download/v1.27/gobgp_1.27_linux_armv6.tar.gz binary with a peer on my macbook. In the morning after i awaked my macbook i found GoBGP crashed after the peer goes down. Here is the log and the trace:

pi@raspberrypi:~/projects/vxlan $ sudo ./gobgpd -t yaml -f rr-config.yaml                                            
{"level":"info","msg":"gobgpd started","time":"2018-01-27T23:06:42Z"}                                         
{"Topic":"Config","level":"info","msg":"Finished reading the config file","time":"2018-01-27T23:06:42Z"}                
{"level":"info","msg":"PeerGroup rr-client is added","time":"2018-01-27T23:06:42Z"}                                     
{"Name":"rr-client","Topic":"Peer","level":"info","msg":"Add a peer group configuration","time":"2018-01-27T23:06:42Z"} 
{"level":"info","msg":"Dynamic Neighbor 192.168.88.0/24 is added to PeerGroup rr-client","time":"2018-01-27T23:06:42Z"} 
{"Key":"192.168.88.175","State":"BGP_FSM_OPENCONFIRM","Topic":"Peer","level":"info","msg":"Peer Up","time":"2018-01-27T23:07:14Z"}                                                                                                              
{"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"warning","msg":"hold timer expired","time":"2018-01-27T23:27:14Z"}                                                                                                
{"Key":"192.168.88.175","Reason":"hold-timer-expired","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"info","msg":"Peer Down","time":"2018-01-27T23:27:14Z"}                                                                              
{"Code":4,"Data":null,"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Subcode":0,"Topic":"Peer","level":"warning":
"msg":"sent notification","time":"2018-01-27T23:27:14Z"}                                                                
{"Key":"192.168.88.175","State":"BGP_FSM_OPENCONFIRM","Topic":"Peer","level":"info","msg":"Peer Up","time":"2018-01-27T2
3:35:32Z"}                                                                                                              
{"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"warning","msg":"hold timer expired","time"
:"2018-01-27T23:37:02Z"}                                                                                                
{"Key":"192.168.88.175","Reason":"hold-timer-expired","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"info","msg":
"Peer Down","time":"2018-01-27T23:37:02Z"}                                                                              
{"Code":4,"Data":null,"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Subcode":0,"Topic":"Peer","level":"warning":
"msg":"sent notification","time":"2018-01-27T23:37:02Z"}                                                                
{"Topic":"Peer","level":"warning","msg":"Cant't find the neighbor 192.168.88.175","time":"2018-01-27T23:37:02Z"}        
{"Key":"192.168.88.175","State":"BGP_FSM_OPENCONFIRM","Topic":"Peer","level":"info","msg":"Peer Up","time":"2018-01-27T23:44:41Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"warning","msg":"hold timer expired","time"
:"2018-01-27T23:46:11Z"}                                                                                                
{"Key":"192.168.88.175","Reason":"hold-timer-expired","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"info","msg":"Peer Down","time":"2018-01-27T23:46:11Z"}
{"Code":4,"Data":null,"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Subcode":0,"Topic":"Peer","level":"warning",
"msg":"sent notification","time":"2018-01-27T23:46:11Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_OPENCONFIRM","Topic":"Peer","level":"info","msg":"Peer Up","time":"2018-01-27T23:53:52Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"warning","msg":"hold timer expired","time":"2018-01-27T23:55:22Z"}
{"Key":"192.168.88.175","Reason":"hold-timer-expired","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"info","msg":"Peer Down","time":"2018-01-27T23:55:22Z"}
{"Code":4,"Data":null,"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Subcode":0,"Topic":"Peer","level":"warning","msg":"sent notification","time":"2018-01-27T23:55:22Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_OPENCONFIRM","Topic":"Peer","level":"info","msg":"Peer Up","time":"2018-01-28T00:01:41Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"warning","msg":"hold timer expired","time":"2018-01-28T00:59:11Z"}
{"Key":"192.168.88.175","Reason":"hold-timer-expired","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"info","msg":"Peer Down","time":"2018-01-28T00:59:11Z"}
{"Code":4,"Data":null,"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Subcode":0,"Topic":"Peer","level":"warning","msg":"sent notification","time":"2018-01-28T00:59:11Z"}
{"Topic":"Peer","level":"warning","msg":"Cant't find the neighbor 192.168.88.175","time":"2018-01-28T00:59:11Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_OPENCONFIRM","Topic":"Peer","level":"info","msg":"Peer Up","time":"2018-01-28T01:06:12Z"}
{"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"warning","msg":"hold timer expired","time":"2018-01-28T01:36:42Z"}
{"Key":"192.168.88.175","Reason":"hold-timer-expired","State":"BGP_FSM_ESTABLISHED","Topic":"Peer","level":"info","msg":"Peer Down","time":"2018-01-28T01:36:42Z"}
{"Code":4,"Data":null,"Key":"192.168.88.175","State":"BGP_FSM_ESTABLISHED","Subcode":0,"Topic":"Peer","level":"warning","msg":"sent notification","time":"2018-01-28T01:36:42Z"}
panic: close of closed channel

goroutine 172 [running]:
github.com/osrg/gobgp/vendor/github.com/eapache/channels.(*InfiniteChannel).Close(...)
        /Users/fujita/Devel/go/src/github.com/osrg/gobgp/vendor/github.com/eapache/channels/infinite_channel.go:40
github.com/osrg/gobgp/server.cleanInfiniteChannel(0x128be800)
        /Users/fujita/Devel/go/src/github.com/osrg/gobgp/server/util.go:30 +0x20
github.com/osrg/gobgp/server.(*Peer).stopFSM(0x12a7aa80, 0x1, 0x1)
        /Users/fujita/Devel/go/src/github.com/osrg/gobgp/server/peer.go:671 +0x53c
created by github.com/osrg/gobgp/server.(*BgpServer).handleFSMMessage
        /Users/fujita/Devel/go/src/github.com/osrg/gobgp/server/server.go:854 +0xea0

My RR config is

global:
  config:
    as: 65000
    router-id: 192.168.88.177
    local-address-list:
      - 192.168.88.177
peer-groups:
  - config:
      peer-group-name: rr-client
      peer-as: 65000
    afi-safis:
      - config:
          afi-safi-name: l2vpn-evpn
    route-reflector:
      config:
        route-reflector-client: true
        route-reflector-cluster-id: 192.168.88.177
dynamic-neighbors:
  - config:
      peer-group: rr-client
      prefix: 192.168.88.0/24
iwaseyusuke commented 6 years ago

@yupi Hi, Hmmm... this issue seems to be related to the implementation of Dynamic Neighbor, but I guess it should be a rare case (timing seems to be severe). Is this issue occurred often?

iwaseyusuke commented 6 years ago

The cause of this issue is Peer.outgoing *channels.InfiniteChannel is doubly closed when restarting a Peer instance for the Dynamic Neighbor, but this channel should re-instantiated right after closed like; https://github.com/osrg/gobgp/blob/38223f2f512cce3d05c9ae8f29ade983aa723cef/server/server.go#L921-L928

Just a workaround (not graceful way in Golang), the following fixes this issue?

diff --git a/server/util.go b/server/util.go
index b122fbd..f401b23 100644
--- a/server/util.go
+++ b/server/util.go
@@ -22,11 +22,21 @@ import (
        "syscall"

        "github.com/eapache/channels"
+       log "github.com/sirupsen/logrus"

        "github.com/osrg/gobgp/packet/bgp"
 )

 func cleanInfiniteChannel(ch *channels.InfiniteChannel) {
+       // Avoid panic when InfiniteChannel is already closed
+       defer func() {
+               if err := recover(); err != nil {
+                       log.WithFields(log.Fields{
+                               "Topic": "Server",
+                       }).Warn("failed to close InfiniteChannel: ", err)
+               }
+       }()
+
        ch.Close()
        // drain all remaining items
        for range ch.Out() {
fujita commented 6 years ago

recover() is the last resort so need to fix in a different way.

fujita commented 6 years ago

@yupi The bug is a race related with the dynamic neighbor feature. We fixed it. I've just released the new version including the fix so please try: https://github.com/osrg/gobgp/releases/tag/v1.28

yupi commented 6 years ago

Thanks, will do tests today.