osrg / gobgp

BGP implemented in the Go Programming Language
https://osrg.github.io/gobgp/
Apache License 2.0
3.61k stars 686 forks source link

Keepalives are not sent until RIB is exchanged between peers #1723

Open dotwaffle opened 6 years ago

dotwaffle commented 6 years ago

I have a TABLE_DUMP_V2 MRTDump file I am testing with, a total of and I'm seeing neighbors send notifications before the peer has sent any prefixes. Here's my configs:

1.toml 2.toml

I start "2" first with: sudo gobgpd -f 2.toml -p -- this allows the GRPC client to work. I then start "1" with: gobgpd -f 1.toml -p I inject my prefixes: time gobgp mrt inject global bird-mrt-routes.dump (takes 49s)

Everything looks good:

% gobgp neighbor 127.0.0.1
BGP neighbor is 127.0.0.1, remote AS 65001
  BGP version 4, remote router ID 192.0.2.1
  BGP state = established, up for 00:01:35
  BGP OutQ = 0, Flops = 0
  Hold time is 90, keepalive interval is 30 seconds
  Configured hold time is 90, keepalive interval is 30 seconds

  Neighbor capabilities:
    multiprotocol:
        ipv4-unicast:   advertised and received
    route-refresh:  advertised and received
    4-octet-as: advertised and received
  Message statistics:
                         Sent       Rcvd
    Opens:                  1          1
    Notifications:          0          0
    Updates:           689972          0
    Keepalives:             4          4
    Route Refresh:          0          0
    Discarded:              0          0
    Total:             689977          5
  Route statistics:
    Advertised:        689972
    Received:               0
    Accepted:               0

I close everything and start again, doing:

I start "2" first with: sudo gobgpd -f 2.toml -p I inject my prefixes: time gobgp mrt inject global bird-mrt-routes.dump (takes 24s) I now start "1" with: gobgpd -f 1.toml -p The BGP peer comes up, but 90s later I see in "1":

INFO[0013] Peer Up                                       Key=127.0.0.1 State=BGP_FSM_OPENCONFIRM Topic=Peer
WARN[0103] hold timer expired                            Key=127.0.0.1 State=BGP_FSM_ESTABLISHED Topic=Peer
INFO[0103] Peer Down                                     Key=127.0.0.1 Reason=hold-timer-expired State=BGP_FSM_ESTABLISHED Topic=Peer
WARN[0103] sent notification                             Code=4 Data="[]" Key=127.0.0.1 State=BGP_FSM_ESTABLISHED Subcode=0 Topic=Peer

While in this state, "2" think the session is still established:

% gobgp neighbor 127.0.0.1
BGP neighbor is 127.0.0.1, remote AS 65001
  BGP version 4, remote router ID 192.0.2.1
  BGP state = established, up for 00:01:36
  BGP OutQ = 0, Flops = 0
  Hold time is 90, keepalive interval is 30 seconds
  Configured hold time is 90, keepalive interval is 30 seconds

  Neighbor capabilities:
    multiprotocol:
        ipv4-unicast:   advertised and received
    route-refresh:  advertised and received
    4-octet-as: advertised and received
  Message statistics:
                         Sent       Rcvd
    Opens:                  1          1
    Notifications:          0          0
    Updates:                0          0
    Keepalives:             1          4
    Route Refresh:          0          0
    Discarded:              0          0
    Total:                  2          6
  Route statistics:
    Advertised:        689972
    Received:               0
    Accepted:               0

As soon as the CPU load dies down (presumably still compiling an adj-RIB-out), the following is seen on "2":

WARN[0377] received notification                         Code=4 Data="[]" Key=127.0.0.1 Subcode=0 Topic=Peer
INFO[0377] Peer Down                                     Key=127.0.0.1 Reason="notification-received code 4(hold timer expired) subcode 0(undefined)" State=BGP_FSM_ESTABLISHED Topic=Peer

I wondered if this was because I was having both gobgpd daemons running on the same host, but I see the same problem when I run it across the internet with two Linux boxes.

The global table still has all the prefixes:

% gobgp global rib summary
Table ipv4-unicast
Destination: 689999, Path: 689999

All I can think of is that RFC4271 defines a keepalive as being an acceptance of the OPEN message, but then also as an End-of-RIB marker -- and clearly since the RIB hasn't been fully exchanged yet, and not a single UPDATE message has been sent, something inside gobgpd doesn't want to send any keepalives yet, despite the hold timer at risk of expiring.

I can confirm that increasing the hold timer to 300 means the hold timer does not expire, but that is very much curing the symptom than attacking the root cause.

Any assistance you're able to provide would be most welcome. It's happening with (yesterday's) master and the latest tagged release too.

fujita commented 6 years ago

Thanks for the report, gobgpd is not designed not to send an notification until sending all the routes in the rib. But unintentionally, gobgpd might work in that way. There is a tx queue; the messages in it will be sent. gobgpd put a notification message to the queue every keep-alive interval. When the session is established, all the routes in the rib are put into the queue. Sending all the routes in the queue might take longer time than keep-alive interval. I'll investigate and work on a fix.

dotwaffle commented 6 years ago

Thank you, @fujita.

The problem isn't that it hasn't finished sending all of the routes -- when the peer comes up, the peer sending the routes hits 100% CPU for a long time before sending even one route. Just sending one route would cause an UPDATE message to be sent, and an UPDATE performs the same function as a KEEPALIVE, keeping the session alive. The problem seems to be the long pause in between the session becoming established and the first UPDATE message being sent, presumably as gobgpd is evaluating the current RIB for eligibility of what to send -- but not sending anything until that work is complete.

I think the problem is here: https://github.com/osrg/gobgp/blob/master/server/fsm.go#L1528

Basically, it looks like the entire outgoing RIB is being calculated here (continuously growing the slice) and then returning the finished list of paths to send, during which the select is being blocked (the ticker is never picked up for the keepalive sending) because it's stuck in the "out" loop while table.CreateUpdateMsgFromPaths() is processing m.Paths (consisting of the full table) which is doing all the route packing and causing the high CPU load. Running pprof seems to agree with this -- most of the runtime is being held by either runtime.cmpbody, runtime.memmove, or the following calls:

     1.43s  5.31% 80.96%     25.61s 95.03%  github.com/osrg/gobgp/table.(*packerV4).add
         0     0% 97.85%     25.65s 95.18%  github.com/osrg/gobgp/server.(*FSMHandler).(github.com/osrg/gobgp/server.sendMessageloop)-fm
         0     0% 97.85%     25.65s 95.18%  github.com/osrg/gobgp/server.(*FSMHandler).sendMessageloop
         0     0% 97.85%     25.65s 95.18%  github.com/osrg/gobgp/table.CreateUpdateMsgFromPaths
         0     0% 97.85%     25.65s 95.18%  github.com/osrg/gobgp/vendor/gopkg.in/tomb%2ev2.(*Tomb).run

I haven't tested this, but I think a "quick fix" would be to either put the keepalive code into a goroutine (so it can't be blocked) or to maybe have the CreateUpdateMsgFromPaths() outside of the select, and to stream into a channel which feeds into that select call in a similar mechanism to the h.outgoing.Out() channel.

I freely admit having so many paths in a table is very rare, but this initial blocking of keepalives is definitely a bug.

Thank you once more!

fujita commented 6 years ago

Thanks for the details!

Can you try the latest master? I fixed a bug so CreatingUpdateMsgFromPaths() should finish quickly (less than 3 seconds with the full routes on my laptop). If send() takes long time, you hit the same issue but this workaround would work in most cases, I think.

As you pointed out, the current logic that could block keepalives surely needs to be fixed. I'll think about the proper fix.

dotwaffle commented 6 years ago

@fujita The latest master is considerably faster, thank you -- not just in computing the Adj-RIB-Out, but seemingly also in exchanging all the messages and reaching a steady state!

I saw you also merged a change to fix "--only-best" for MRTDump injection, thank you very much! It means I can test with RIPE RIS dumps now and not just data from one of my production systems ;)