Deadlock/overload in gateway during spiking goroutine count

petar commented 2 years ago

@Stebalien reported the deadlock below. This is caused either by a bug or by lack of graceful degradation handling where needed.

goroutine 13827044 [semacquire, 19 minutes]:
sync.runtime_SemacquireMutex(0xc0020fd8a4, 0xc064b9d800, 0x1)
    /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0020fd8a0)
    /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
    /usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc0020fd8a0)
    /usr/local/go/src/sync/rwmutex.go:98 +0x90
github.com/ipfs/go-bitswap/network.(*connectEventManager).Disconnected(0xc0020fd890, 0xc042f9e9f0, 0x22)
    pkg/mod/github.com/ipfs/go-bitswap@v0.3.4/network/connecteventmanager.go:49 +0x47
github.com/ipfs/go-bitswap/network.(*netNotifiee).Disconnected(0xc000c7e000, 0x23b4b40, 0xc000312a80, 0x23ae8e0, 0xc0539f1290)
    pkg/mod/github.com/ipfs/go-bitswap@v0.3.4/network/ipfs_impl.go:442 +0x98
github.com/libp2p/go-libp2p-swarm.(*Conn).doClose.func1.1(0x23a07a0, 0xc000c7e000)
    pkg/mod/github.com/libp2p/go-libp2p-swarm@v0.5.0/swarm_conn.go:85 +0x5b
github.com/libp2p/go-libp2p-swarm.(*Swarm).notifyAll.func1(0xc0636baa20, 0xc0558265d0, 0x23a07a0, 0xc000c7e000)
    pkg/mod/github.com/libp2p/go-libp2p-swarm@v0.5.0/swarm.go:559 +0x66
created by github.com/libp2p/go-libp2p-swarm.(*Swarm).notifyAl)
    pkg/mod/github.com/libp2p/go-libp2p-swarm@v0.5.0/swarm.go:557 +0x118

The node was up for 28 minutes, and this has been stuck for 19. Unfortunately, I can't find the goroutine that's actually blocking this because the stack trace was truncated.

Debugging fixtures for investigating this problem in https://github.com/ipfs/go-ipfs/pull/8380

BigLep commented 2 years ago

Next steps options:

Post dump
Post pprof image
Substitute in code lock with a tracing lock and deploying to one box.
See if can simulate load to manifest the issue locally. This would then enable us to see if any code changes fix the issue.
Pulling in extra eyes on the code

Various tools that may be useful:

@whyrusleeping stackparse https://github.com/whyrusleeping/stackparse/
https://github.com/linuxerwang/goroutine-inspect
https://github.com/sasha-s/go-deadlock

Stebalien commented 2 years ago

I'm seeing logging backed up:

goroutine 98252740 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0xc00000e10c, 0xc2ab0fee00, 0x1)
    /usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00000e108)
    /usr/local/go/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
    /usr/local/go/src/sync/mutex.go:81
go.uber.org/zap/zapcore.(*lockedWriteSyncer).Write(0xc00000e108, 0xc5f0da6000, 0xd9, 0x400, 0x2fd0ea0, 0x1f9d0c4, 0x6)
    pkg/mod/go.uber.org/zap@v1.19.0/zapcore/write_syncer.go:65 +0xe8
go.uber.org/zap/zapcore.(*ioCore).Write(0xc00018af30, 0x2, 0xc043a0aa2f8580d1, 0x7ef9a4ece61, 0x2fd0ea0, 0x1f9d0c4, 0x6, 0x20a7078, 0x2c, 0x1, ...)
    pkg/mod/go.uber.org/zap@v1.19.0/zapcore/core.go:90 +0x10c
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc5f0fcae40, 0xc4b621c080, 0x1, 0x2)
    pkg/mod/go.uber.org/zap@v1.19.0/zapcore/entry.go:220 +0x12d
go.uber.org/zap.(*SugaredLogger).log(0xc000132108, 0x2, 0x20a7078, 0x2c, 0x0, 0x0, 0x0, 0xc5c9c5b750, 0x2, 0x2)
    pkg/mod/go.uber.org/zap@v1.19.0/sugar.go:227 +0x111
go.uber.org/zap.(*SugaredLogger).Errorw(...)
    pkg/mod/go.uber.org/zap@v1.19.0/sugar.go:191
github.com/ipfs/go-bitswap/internal/decision.(*Engine).ReceiveFrom(0xc000358d20, 0xc002671fb0, 0x26, 0xc317ef2240, 0x1, 0x1)
    pkg/mod/github.com/ipfs/go-bitswap@v0.4.0/internal/decision/engine.go:691 +0xbee
github.com/ipfs/go-bitswap.(*Bitswap).receiveBlocksFrom(0xc001812000, 0x23be9f0, 0xc000040078, 0xc002671fb0, 0x26, 0xc317ef21f0, 0x1, 0x1, 0x3002258, 0x0, ...)
    pkg/mod/github.com/ipfs/go-bitswap@v0.4.0/bitswap.go:487 +0x2e5
github.com/ipfs/go-bitswap.(*Bitswap).ReceiveMessage(0xc001812000, 0x23be9f0, 0xc000040078, 0xc002671fb0, 0x26, 0x23dfff8, 0xc25b8ee5a0)
    pkg/mod/github.com/ipfs/go-bitswap@v0.4.0/bitswap.go:547 +0x3c5
github.com/ipfs/go-bitswap/network.(*impl).handleNewStream(0xc0023d4f00, 0x23da598, 0xc5f746a7c0)
    pkg/mod/github.com/ipfs/go-bitswap@v0.4.0/network/ipfs_impl.go:423 +0x394
github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1(0xc11cd16750, 0x13, 0x7fbda451a078, 0xc5f746a7c0, 0x1, 0x0)
    pkg/mod/github.com/libp2p/go-libp2p@v0.15.0-rc.1/p2p/host/basic/basic_host.go:529 +0xa2
created by github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).newStreamHandle)
    pkg/mod/github.com/libp2p/go-libp2p@v0.15.0-rc.1/p2p/host/basic/basic_host.go:372 +0x63a

This appears to be backing everything else up a bit. However, I believe we observed the same problem without these log statements, so we can't immediately blame them. I'm going to drop these to "debug" so we can continue debugging.

Stebalien commented 2 years ago

Let's try again with https://github.com/ipfs/go-bitswap/pull/526.

guseggert commented 2 years ago

We've been unable to confirm that logging is the bottleneck here. We deployed that change that reduced bitswap logging verbosity. We also suspected that logging to the serial console was creating the bottleneck, due to journald writing that synchronously, but that was disabled and the issue remains. Also sometimes I take stack dumps while a host is sick, and I don't see any goroutines blocked on logging (doesn't rule it out entirely). The volume of logs is not very high, e.g. in dc12-b3 over the past hour we avergae ~48 log events per second into journald. So while we haven't ruled out logging entirely, we haven't found much additional evidence that supports it as the bottleneck.

I've got two more things to try here: turn off peerlogs entirely on a host (this change is already queued up for deployment), and stress test journald just to sanity check its limits. After that I think we have enough data to support logging not being the bottleneck.

BigLep commented 2 years ago

@guseggert : sounds good - thanks for the update. I assume we'll talk about what's next after we rule out logging.

Also, do we have the pprof image of the dump? I assume that doesn't point any particular direction.

guseggert commented 2 years ago

Yes I have quite a few at this point :) Here's the original we took: stack.gz

guseggert commented 2 years ago

Coming back from our off-site. The logging was a red herring...we fixed a number of related issues which didn't improve the situation (although still important fixes):

Don't log synchronously to console, which has very limited bandwidth
Turn off peerlogs for all gateways (greatly reduces verbosity)
Spamming journald

During these goroutine spikes the hosts are CPU-bound and not IO-bound. CPU profiles show that the CPU is busy iterating and copying maps (Bitswap wantlists). Here are the most frequent stack traces:

    local.runtime.mapiternext
    github.com/ipfs/go-bitswap/message.(*impl).Wantlist
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

    local.runtime.mapiternext
    local.runtime.mapiterinit
    github.com/ipfs/go-bitswap/message.(*impl).Wantlist
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

    local.runtime.mapiternext
    github.com/ipfs/go-bitswap/wantlist.(*Wantlist).Entries
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).extractOutgoingMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendMessage
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).sendIfReady
    github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue

@Stebalien has a PR here https://github.com/ipfs/go-bitswap/pull/530 that reuses cached wantlists in certain scenarios, and I deployed that to a box but it did not help, the CPU still spins on a combination of map iteration and GC.

Side note: mtail is showing up a lot in CPU profiles too, it uses an entire core to process the nginx access logs. I don't think it's the main culprit though, but something to keep an eye on.

I'm currently trying to make some optimizations to Bitswap so that it doesn't need to iterate and copy maps as much, which will take me some time since I am still ramping up on the codebase.

Ultimately though we are missing the right kind of backpressure to prevent these situations where we take on more work than we can handle. There will always be room for optimizations, but the service should protect itself from falling over. In this case, it looks like we're hitting processing limits related to # reqs, wantlist sizes, and # peers...we can perhaps apply backpressure at that level. I also have a long list of metrics we can add to Bitswap to help future investigations, some of the internals of Bitswap and its different mechanisms need more instrumentation.

Stebalien commented 2 years ago

@guseggert can broadcast fewer wants? I.e., one broadcast want per session? I agree we need better back pressure, but we should be able to cut down the work by several orders of magnitude.

Can you also upload the new profiles?

guseggert commented 2 years ago

I haven't been able to focus on this much, but do have some newer info. Bitswap is CPU-bound under certain workloads, and our public gateways do hit this bottleneck often, but I don't think it's the main contributor to the goroutine pileup...looking through metrics in the past when hosts have experienced this, they are not strongly correlated.

This morning I pushed out a change to Bitswap switching all mutexes to https://github.com/sasha-s/go-deadlock...I configured it to report goroutines stuck acquiring locks after >1 min. The lib tracks the last goroutine to acquire stuck locks too, and every log entry pointed to connectEventManager.MarkUnresponsive:

POTENTIAL DEADLOCK:
Previous place where the lock was grabbed
goroutine 39080090 lock 0xc002991150
pkg/mod/github.com/ipfs/go-bitswap@v0.4.1-0.20211003175403-043c0703cec0/network/connecteventmanager.go:67 network.(*connectEventManager).MarkUnresponsive ??? <<<<<

Runtime behavior was quite different though, goroutines immediately started piling up at a different rate than normal, with no recovery, so I rolled back the change. Either I misconfigured it, or it has some sort of runtime impact in a way that causes Bitswap to behave differently. Reading through MarkUnresponsive code paths, I cannot find any deadlocks or work that could get stuck (take that with a grain of salt, I am still relatively new to the codebase). I'm suspicious of that b/c the runtime behavior changed, but it could still be pointing in the right direction.

For now I'm pausing my investigation into this goroutine issue and focusing instead of adding tracing, metrics, and synthetic monitors.

On the CPU side, here's a CPU profile showing the bottleneck on a host at the 95% CPU utilization plateau:

cpuprofile-1.txt

We should set scaling policies to keep the CPU from going that high, standard thresholds should work, I think 80% would be a good threshold. In many regions were are routinely hitting the 95% plateau which is likely impacting latency.

BigLep commented 2 years ago

@guseggert : will attach the large go routine pprof.

Stebalien commented 2 years ago

@guseggert I've written up the idea I proposed in our discussion in https://github.com/ipfs/go-libipfs/issues/81 (well, a slight variant on it). We likely won't get to it for a while, but it's something to keep on the radar.

BigLep commented 2 years ago

@guseggert : can you please attach the large go routine pprof and then we'll park this for the time being.

BigLep commented 2 years ago

@guseggert : what are the next steps for this?

iand commented 1 year ago

I think we encountered this bug in Thunderdome. I wrote some notes on it on Notion

I took a full profile which I am investigating. You can get it at bafybeihlgdmnkkaawrge3yzqbocbkzf4mvbcgzxibq4tpwufklr4deqcyu

Elevated persistent goroutines and heap (yellow top line):

Snapshot_2022-09-07_16-28-57

Snapshot_2022-09-07_16-29-15

Also elevated number of peers:

Snapshot_2022-09-07_16-29-26

jtsmedley commented 1 year ago

I believe we encountered this issue yesterday that brought our cluster to a complete crawl. Server 4 was actively selected by the cluster allocator during this time. We are running with a concurrency of 10 pins on all nodes and our cluster has 17 nodes. The issue also oddly caused a large amount of read bandwidth to occur, but the bandwidth appeared to go nowhere as the network chart was nearly flat.

Cluster: v1.0.4 Kubo: v0.15.0

Server 4

2022-10-27-09-30-grafana filebase com

Server 5

2022-10-27-09-32-grafana filebase com

ipfs / kubo

Deadlock/overload in gateway during spiking goroutine count #8381

Server 4

Server 5