Closed ecbaldwin closed 6 years ago
This is with 008c961ecd943739c5db63fcd931904804a45aa5.
I looked at this a little bit on Friday. I can reproduce it pretty easily on a VM by adding two addresses, 10.0.0.1/32
and 10.0.0.2/32
onto lo
like this:
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet 10.0.0.1/32 scope global lo
valid_lft forever preferred_lft forever
inet 10.0.0.2/32 scope global lo
valid_lft forever preferred_lft forever
Then, I run two instances of gobgp. The first has this config.
[global.config]
as = 4200000001
router-id = "10.0.0.1"
local-address-list = ["10.0.0.1"]
[[neighbors]]
[neighbors.config]
neighbor-address = "10.0.0.2"
peer-as = 4200000002
[neighbors.transport.config]
local-address = "10.0.0.1"
[neighbors.ebgp-multihop.config]
enabled = true
multihop-ttl = 16
[neighbors.graceful-restart.config]
enabled = true
notification-enabled = true
long-lived-enabled = true
restart-time = 120
[[neighbors.afi-safis]]
[neighbors.afi-safis.config]
afi-safi-name = "ipv4-unicast"
[neighbors.afi-safis.mp-graceful-restart.config]
enabled = true
[neighbors.afi-safis.long-lived-graceful-restart.config]
enabled = true
restart-time = 43200
The other instance has essentially the same config but with the IPs and ASNs swapped.
There is contention around the peer.fsm.lock
semaphore. Each time I reproduce, I see a goroutine stuck trying to obtain a write lock.
goroutine 27 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc4202aaa80)
/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*RWMutex).Lock(0xc4202aaa78)
/usr/local/go/src/sync/rwmutex.go:98 +0x6e
github.com/osrg/gobgp/pkg/server.(*FSM).bgpMessageStateUpdate(0xc4202aaa20, 0xc420f50002)
/root/go/src/github.com/osrg/gobgp/pkg/server/fsm.go:220 +0x47
github.com/osrg/gobgp/pkg/server.(*FSMHandler).sendMessageloop.func1(0xc420f20750, 0x1, 0x1)
/root/go/src/github.com/osrg/gobgp/pkg/server/fsm.go:1617 +0xad7
github.com/osrg/gobgp/pkg/server.(*FSMHandler).sendMessageloop(0xc4201ed360, 0xc4202aaa78, 0xc4200ae160)
/root/go/src/github.com/osrg/gobgp/pkg/server/fsm.go:1685 +0x10f
github.com/osrg/gobgp/pkg/server.(*FSMHandler).(github.com/osrg/gobgp/pkg/server.sendMessageloop)-fm(0x9a7000, 0x2)
/root/go/src/github.com/osrg/gobgp/pkg/server/fsm.go:1724 +0x2a
github.com/osrg/gobgp/vendor/gopkg.in/tomb%2ev2.(*Tomb).run(0xc4201ed360, 0xc4200a1aa0)
/root/go/src/github.com/osrg/gobgp/vendor/gopkg.in/tomb.v2/tomb.go:163 +0x2b
created by github.com/osrg/gobgp/vendor/gopkg.in/tomb%2ev2.(*Tomb).Go
/root/go/src/github.com/osrg/gobgp/vendor/gopkg.in/tomb.v2/tomb.go:159 +0xb9
I haven't been able to tell yet which thread is hanging onto its locking resulting in this deadlock. I think I've been able to rule out another write lock because I instrumented all of the calls that acquire a write lock on this semaphore and I don't see anything acquiring and not releasing it. So, it must be one of the many read locks being acquired and then not being released. I looked at other goroutines which are waiting on locks but nothing seemed like an obvious deadlock culprit. I'll look for some better ways to instrument the code and figure out what is going on.
I seem to notice a thread like the following each time I reproduce the problem. Since this seems to happen when graceful restart is enabled, I'm think it might have something to do with it. I'll look through the stack trace to see if I can find something:
goroutine 29 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc4202aaa84)
/usr/local/go/src/runtime/sema.go:56 +0x39
sync.(*RWMutex).RLock(0xc4202aaa78)
/usr/local/go/src/sync/rwmutex.go:50 +0x49
github.com/osrg/gobgp/pkg/server.(*Peer).llgrFamilies(0xc4202c6300, 0xc3b908, 0xc4202aaa78, 0x970d03, 0xc420b2c0f0, 0xc420b2c101, 0x1)
/root/go/src/github.com/osrg/gobgp/pkg/server/peer.go:272 +0x4d
github.com/osrg/gobgp/pkg/server.(*Peer).isLLGREnabledFamily(0xc4202c6300, 0x10001, 0x0)
/root/go/src/github.com/osrg/gobgp/pkg/server/peer.go:289 +0x93
github.com/osrg/gobgp/pkg/server.(*BgpServer).filterpath(0xc42022c000, 0xc4202c6300, 0xc42128d900, 0x0, 0x0)
/root/go/src/github.com/osrg/gobgp/pkg/server/server.go:589 +0x3b3
github.com/osrg/gobgp/pkg/server.(*BgpServer).processOutgoingPaths(0xc42022c000, 0xc4202c6300, 0xc42000ebe0, 0x1, 0x1, 0xc42000ebe8, 0x1, 0x1, 0x1, 0xc42000ebe8, ...)
/root/go/src/github.com/osrg/gobgp/pkg/server/server.go:877 +0x166
github.com/osrg/gobgp/pkg/server.(*BgpServer).propagateUpdateToNeighbors(0xc42022c000, 0x0, 0xc420b2e050, 0xc42000ebc8, 0x1, 0x1, 0xc420b2e001)
/root/go/src/github.com/osrg/gobgp/pkg/server/server.go:1127 +0x32d
github.com/osrg/gobgp/pkg/server.(*BgpServer).propagateUpdate(0xc42022c000, 0x0, 0xc420acc000, 0x2710, 0x2710)
/root/go/src/github.com/osrg/gobgp/pkg/server/server.go:1026 +0x258
github.com/osrg/gobgp/pkg/server.(*BgpServer).addPathList(0xc42022c000, 0x0, 0x0, 0xc420acc000, 0x2710, 0x2710, 0x2710, 0x0)
/root/go/src/github.com/osrg/gobgp/pkg/server/server.go:1823 +0xd5
github.com/osrg/gobgp/pkg/server.(*Server).AddPathStream(0xc42000c1e0, 0xcb4300, 0xc4200a1b80, 0xcb8cc0, 0xc42000c1e0)
/root/go/src/github.com/osrg/gobgp/pkg/server/grpc_server.go:792 +0xf2
github.com/osrg/gobgp/api._GobgpApi_AddPathStream_Handler(0xc00520, 0xc42000c1e0, 0xcb33a0, 0xc420206120, 0x0, 0x0)
/root/go/src/github.com/osrg/gobgp/api/gobgp.pb.go:7104 +0xb2
github.com/osrg/gobgp/vendor/google.golang.org/grpc.(*Server).processStreamingRPC(0xc42022e000, 0xcb4000, 0xc4202198c0, 0xc42015e500, 0xc420166b10, 0x121a620, 0x0, 0x0, 0x0)
/root/go/src/github.com/osrg/gobgp/vendor/google.golang.org/grpc/server.go:892 +0x8a5
github.com/osrg/gobgp/vendor/google.golang.org/grpc.(*Server).handleStream(0xc42022e000, 0xcb4000, 0xc4202198c0, 0xc42015e500, 0x0)
/root/go/src/github.com/osrg/gobgp/vendor/google.golang.org/grpc/server.go:986 +0x12b1
github.com/osrg/gobgp/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc420272e10, 0xc42022e000, 0xcb4000, 0xc4202198c0, 0xc42015e500)
/root/go/src/github.com/osrg/gobgp/vendor/google.golang.org/grpc/server.go:551 +0x9f
created by github.com/osrg/gobgp/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
/root/go/src/github.com/osrg/gobgp/vendor/google.golang.org/grpc/server.go:549 +0xa1
I think I found a smoking gun in the last trace that I pasted above. First, isLLGREnabledFamily
grabs a read lock on the semaphore with a defer statement to release it on exiting the method. It then calls llgrFamilies
which attempts to grab a read lock on its own.
The above PR is a naive change that shows that this is in fact the issue. I am able to get through announcing over 1mm routes without any deadlock issues. I'll work on the PR and get it into shape to merge.
To make it more explicit what is happening here, I quote the docs for the sync package:
RLock locks rw for reading.
It should not be used for recursive read locking; a blocked Lock call excludes new readers from acquiring the lock. See the documentation on the RWMutex type.
This deadlock happens when the bgpMessageStateUpdate thread blocks for a write lock in the middle of the recursive read lock in the isLLGREnabledFamily thread.
I think #1821 is in better shape now. I consider it a good enough fix for the issue in the near term.
I think a better long-term way to handle this would be to rearchitect the sharing of the fsm, fsm.pConf, and whatever else is under synchronization with this lock. IMO, a single goroutine should be responsible for managing updates to fsm and should push updated copies of it to other goroutines through channels which will pick up the updates by selecting on the channel.
My observation is that updates to the fsm do not happen that frequently in the grand scheme of things. Also, I don't see any harm in another goroutine getting an update to the fsm after finishing current operations. My opinion is that it would be more consistent for it to wait to get an update to the state by selecting on the updates channel.
Again, thanks a lot for fixing this. Let's discuss the long-term solution on the different issue, as I commented on the pr.
The switch I'm working with can handle 512k host routes in its fib and so I was trying to push that many with gobgp. Without GR enabled, I can easily push this many routes. However, when I started using GR or LLGR, gobgpd would hang.
I tried three methods:
I observed the first one hanging before it hit 100 routes. The second seems to hang after pushing around 700 - 800 routes to the peer. The last one actually gets pretty far. I've seen it get over 300,000 before hanging. Of course, it is about 3 times slower that the other methods.
When gobgpd is hung, it does not respond to API requests nor a SIGTERM. A SIGINT will kill it.
Below is the script I used to generate the routes. You'll see all three methods in the add_paths function. Just some of them are commented out.