Research status-go not accepting new connections

jakubgs commented 3 years ago

We've seen an issue where after some time in operation a status-go node stops accepting new connections:

 > node-canary -log=WARN -staticnode='enode://e87f1d8093d304c3a9d6f1165b85d6b374f1c0cc907d39c0879eb67f0a39d779be7a85cbd52920b6f53a94da43099c58837034afa6a7be4b099bfcd79ad13999@35.238.106.101:443'    
Error detected while waiting for static peer to be added package=status-go/cmd/node-canary error="wait for peer: timeou

This example is node-08.gc-us-central1-a.eth.prod. We can also see its p2p_peers_absolute metrics flatline:

jakubgs commented 3 years ago

Based on alerts from tonight it seems like this is mostly affecting Google Cloud US hosts:

But that might be just a coincidence.

jakubgs commented 3 years ago

Just to be sure I built and ran the node-canary locally on the host affected by this issue:

admin@node-08.gc-us-central1-a.eth.prod:~ % node-canary -log=WARN -staticnode='enode://e87f1d8093d304c3a9d6f1165b85d6b374f1c0cc907d39c0879eb67f0a39d779be7a85cbd52920b6f53a94da43099c58837034afa6a7be4b099bfcd79ad13999@35.238.106.101:443'
ERROR[01-22|08:30:09.197] Error detected while waiting for static peer to be added package=status-go/cmd/node-canary error="wait for peer: timeout"

Same result. So I'd say it's not related to firewall rules or networking.

jakubgs commented 3 years ago

The number of file descriptors looks fine:

 % sudo ls /proc/$(pgrep statusd)/fd | wc -l 
105

As does the number of sockets:

 % sudo cat /proc/$(pgrep statusd)/net/sockstat
sockets: used 218
TCP: inuse 8 orphan 0 tw 2 alloc 511 mem 280
UDP: inuse 1 mem 4
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0

jakubgs commented 3 years ago

The state of connections within the container doesn't look weird:

bash-4.4# netstat -pnt | awk '{a[$6]++}END{for(x in a){printf "%5d - %s\n", a[x], x}}' | sort
    1 - TIME_WAIT
   29 - ESTABLISHED
  110 - CLOSE_WAIT

jakubgs commented 3 years ago

When I used strace on the process I can see a lot of those:

[pid  1304] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)

I counted about 50000 in 10 seconds.

jakubgs commented 3 years ago

Essentially what the strace shows is that the process is using a futex to handle connections:

[pid  1295] futex(0x279b490, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid  1287] <... futex resumed>)        = 0
[pid  1295] <... futex resumed>)        = 1
[pid  1289] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1287] sched_yield()               = 0
[pid  1295] futex(0x279f540, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=36878} <unfinished ...>
[pid  1289] futex(0x279f5c0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=58797} <unfinished ...>
[pid  1287] futex(0x279b390, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1295] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1289] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1287] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid  1295] futex(0x279f540, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49659} <unfinished ...>
[pid  1289] futex(0x279f5c0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=21047} <unfinished ...>
[pid  1295] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid  1289] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
...

As compared to a healthy process:

[pid 81832] futex(0xc00005f9c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 81823] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 82254] read(98,  <unfinished ...>
[pid 81824] <... futex resumed>)        = 0
[pid 81770] <... futex resumed>)        = 1
[pid 82254] <... read resumed>"=\304\200x0\367\351\322\344\203\232\267\201\244\316\225\341P\374\233,\177e#\301\33\355\nM\324\350\206", 32) = 32
[pid 81823] <... nanosleep resumed>NULL) = 0
[pid 81824] futex(0xc00005f9c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 82254] read(98,  <unfinished ...>
[pid 81823] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 81824] <... futex resumed>)        = 1
[pid 82254] <... read resumed>"o\235\201\351\310\30\225L7\304\334-\201\2\373_\317\235\227\344\312\276XpYF\275s\\\357\327\215"..., 48) = 48
[pid 81832] <... futex resumed>)        = 0
[pid 81770] futex(0x279f5c0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=348719} <unfinished ...>
[pid 82254] read(98,  <unfinished ...>
[pid 81832] epoll_pwait(4,  <unfinished ...>
[pid 81824] futex(0x279f540, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=227266} <unfinished ...>
[pid 82254] <... read resumed>"\\yA\22\251\344\224\32\324\251\223\206&s\201\252", 16) = 16
[pid 81832] <... epoll_pwait resumed>[{EPOLLOUT, {u32=596904736, u64=140050890492704}}], 128, 0, NULL, 0) = 1
[pid 81823] <... nanosleep resumed>NULL) = 0
[pid 82254] write(98, "\"\325\4\2054\211,\300\242\302\0324\375l\271xB3\3711?\307u\346\2\331\1\236A\240[\222", 32 <unfinished ...>
[pid 81823] nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
[pid 81832] read(98,  <unfinished ...>
...

Where we can see use of epoll_pwait, read, and write when communicating with other peers.

jakubgs commented 3 years ago

We can see a pretty sharp drop in envelopes received at 17:08:00 UTC:

jakubgs commented 3 years ago

We can also see it pretty clearly in a drop of packets received and sent:

jakubgs commented 3 years ago

IRQs drop off too:

jakubgs commented 3 years ago

There is no visible spike or drop in container memory usage:

But there is a drop in CPU usage for one core:

jakubgs commented 3 years ago

We can see a spike in TCP connections aborted due to timeouts and the other end closing them:

But for some reason we also see a raise in connections:

This metric is called netdata_netfilter_conntrack_sockets_active_connections_average and according to the description it is:

Netfilter Connection Tracker performance metrics. The connection tracker keeps track of all connections of the machine, inbound and outbound. It works by keeping a database with all open connections, tracking network and address translation and connection expectations.

jakubgs commented 3 years ago

We can also see a spike in allocated TCP sockets:

And a drop off in orphan sockets.

jakubgs commented 3 years ago

I enabled pprof port on node-08.gc-us-central1-a.eth.prod so we can investigate further when it gets the issue again.

cammellos commented 3 years ago

The first thing I looked for is deadlocks. So searching for Lock will show us which goroutines are stuck:

sync.(*RWMutex).Lock(0xc000176348)
#011/usr/local/go/src/sync/rwmutex.go:103 +0x88
github.com/status-im/status-go/peers.(*TopicPool).processFoundNode(0xc000176300, 0xc0004ef080, 0xc004c16620, 0x0, 0x0)
#011/go/src/github.com/status-im/status-go/peers/topicpool.go:482 +0x5c

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/peers/topicpool.go#L482

Logs line: 965

sync.(*Mutex).Lock(...)
#011/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc000176348)
#011/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/status-im/status-go/peers.(*TopicPool).limitFastMode.func1(0xc000176300, 0x6fc23ac00, 0xc004fa9e00)
#011/go/src/github.com/status-im/status-go/peers/topicpool.go:291 +0x13b

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/peers/topicpool.go#L291

Logs line: 23659

Are both waiting on the same Lock, which indicates that there might be a third process keeping that lock.

So we look for goroutines stuck on TopicPool code.

goroutine 194 [select, 4 minutes]:
github.com/ethereum/go-ethereum/p2p.(*Server).AddPeer(0xc0004ef080, 0xc001e42070)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:327 +0xbd
github.com/status-im/status-go/peers.(*TopicPool).addServerPeer(0xc000176300, 0xc0004ef080, 0xc001ad3bc0)
#011/go/src/github.com/status-im/status-go/peers/topicpool.go:541 +0x7b
github.com/status-im/status-go/peers.(*TopicPool).AddPeerFromTable(0xc000176300, 0xc0004ef080, 0x0)

There are a few, but this looks suspicious, as we know that we can't add new peers really, so a call stuck on AddPeer would make sense.

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/peers/topicpool.go#L541

And it's calling from:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/peers/topicpool.go#L395

Which locks. So it seems the deadlock is due to the p2p server being stuck. So that's the next thing to analyze.

Let's check what the server is doing. We know the main loop of the server is probably stuck, so we look for it:

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/p2p/server.go#L649

and that's the server loop:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/p2p/server.go#L699

and we find it in the logs:

goroutine 113 [select, 4 minutes]:
github.com/ethereum/go-ethereum/p2p.(*Peer).Disconnect(0xc0032b98c0, 0x0)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/peer.go:164 +0xb7
github.com/ethereum/go-ethereum/p2p.(*Server).run(0xc0004ef080, 0x1bf4f40, 0xc0000c2c40)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:722 +0x17cc
created by github.com/ethereum/go-ethereum/p2p.(*Server).Start
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:505 +0x4aa

stuck on Disconnect called on a peer.

Server code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/p2p/server.go#L722

Peer code:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/p2p/peer.go#L165

From the peer code we see that is publishing on a channel. That means the channel is full/not being processed, so it's blocking.

Who is processing w.disc and why is stuck? We'll look for it

That's the processing loop and it's in the run() function of peer:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/p2p/peer.go#L237

Ok, now it's tricky. There are many peers objects, so we will have one go routine for each. We need to find the right one.

We have the address of the object from the previous stack trace

github.com/ethereum/go-ethereum/p2p.(*Peer).Disconnect(0xc0032b98c0, 0x0)

And the address is this one:

0xc0032b98c0

We'll look for the address in the logs, and we find it:

goroutine 9357 [chan receive, 4 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0004ef208, 0x14db580, 0xc003d67680, 0xc00046d760)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/event/feed.go:133 +0x110
github.com/ethereum/go-ethereum/p2p.(*Server).runPeer(0xc0004ef080, 0xc0032b98c0)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:1053 +0x1e9

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/p2p/server.go#L1053

Ok, so it's not the peer that is stuck. We are not actually running the peer loop, as the goroutine handling this is stuck at Feed.Send, which should not really be blocking. So that's where we go next. Let's pull the specific feed address:

github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0004ef208, 0x14db580, 0xc003d67680, 0xc00046d760)

therefore the address is:

0xc0004ef208

Ok, many, many goroutines stuck on Send, let's just have a look at what send does:G

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/event/feed.go#L133

All stuck on sendLock, so the send lock is on. We need to find another stack trace that it's not stack there, as it means someone is keeping that lock:

Ok, this took a bit of time but:

goroutine 5758 [select, 4 minutes]:
reflect.rselect(0xc006111a00, 0x2, 0x2, 0x17e9ff1, 0xe)
#011/usr/local/go/src/runtime/select.go:542 +0x38a
reflect.Select(0xc001a71080, 0x2, 0x3, 0x14db580, 0xc000475c80, 0x16, 0x1, 0x0)
#011/usr/local/go/src/reflect/value.go:2229 +0x170
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0004ef208, 0x14db580, 0xc000475c80, 0x27d9880)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/event/feed.go:170 +0x594
github.com/ethereum/go-ethereum/p2p.(*Server).runPeer(0xc0004ef080, 0xc0037d56e0)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:1064 +0x3f5
created by github.com/ethereum/go-ethereum/p2p.(*Server).run

Here's our process that is hanging

Code: https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/event/feed.go#L170

Ok, this looks really weird, is using the reflect package to Select on a bunch of statement. Unfortunately this stack trace ends here, as we don't know what the content of that cases array is, as it's not printed in the stack trace. We will have a look at the producers that are adding select statements and understand where those might be blocking.

To be continued.....

Notes (same content): https://notes.status.im/mailserver-stuck-error-logs

statusd_dump.log

cammellos commented 3 years ago

Ok, back where we left off.

We need to look how things get added to sendCases. We can see that it's a bit indirect, first they get added to m.inbox and eventually added back on Send to test cases.

Here they get added:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/event/feed.go#L91

which is the Subscribe function call:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/event/feed.go#L73

And eventually they make their way to sendCases:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/vendor/github.com/ethereum/go-ethereum/event/feed.go#L137

So the code that hangs is some code that at some point calls Subscribe and does not read on that channel. This look extremely flimsly code, but what do I know.

Ok, these seems to be the place where we use event.Feed

api/nimbus_backend.go
api/geth_backend.go
vendor/github.com/ethereum/go-ethereum/p2p/peer.go
vendor/github.com/ethereum/go-ethereum/p2p/server.go
vendor/github.com/ethereum/go-ethereum/p2p/metrics.go
vendor/github.com/ethereum/go-ethereum/p2p/message.go
vendor/github.com/ethereum/go-ethereum/core/blockchain.go
vendor/github.com/ethereum/go-ethereum/core/tx_pool.go
vendor/github.com/ethereum/go-ethereum/accounts/manager.go
vendor/github.com/ethereum/go-ethereum/accounts/scwallet/hub.go
vendor/github.com/ethereum/go-ethereum/accounts/keystore/keystore.go
vendor/github.com/ethereum/go-ethereum/accounts/usbwallet/hub.go
vendor/github.com/ethereum/go-ethereum/light/txpool.go
vendor/github.com/ethereum/go-ethereum/light/lightchain.go
vendor/github.com/ethereum/go-ethereum/eth/downloader/peer.go
whisper/whisper.go
waku/waku.go
services/wallet/service.go
services/wallet/transmitter.go
services/wallet/reactor.go
services/wallet/service_test.go
services/wallet/commands_test.go
services/wallet/commands.go
services/accounts/service.go
services/accounts/accounts.go
services/local-notifications/core_test.go
services/local-notifications/core.go

This will take a while to figure out, but the plan is to look at those places, find the channel, find the consumer, and check on the logs if there's anything stuck consuming that channel. Fun time ahead.

cammellos commented 3 years ago

Ok, a lot of these are either in the ethereum codebase, so I should assume that bugs are not there as more people would have reported the issue, or they are on services that are not started in normal nodes. So first I will start investigating from our code, namely whisper/waku.

Here's is where it gets called:

https://github.com/status-im/status-go/blob/31f2196cb2fcb2783b27531082b8bd0a87172a64/waku/waku.go#L536

and we subscribe to the events, let's look for a consumer of that channel.

Many :(

eth-node/bridge/geth/waku.go
eth-node/bridge/geth/whisper.go
eth-node/bridge/nimbus/whisper.go
eth-node/types/waku.go
eth-node/types/whisper.go
whisper/whisper.go
waku/waku.go
cmd/node-canary/main.go
services/shhext/api_nimbus.go
services/shhext/service.go
services/shhext/api_geth.go
services/shhext/service_nimbus.go
services/wakuext/api.go
services/ext/mailrequests.go
services/ext/mailservers/connmanager.go
services/ext/mailservers/connmonitor.go
protocol/transport/whisper/envelopes.go
protocol/transport/waku/envelopes.go
protocol/transport/waku/waku_service.go

Oh hello friend:

https://github.com/status-im/status-go/blob/f73fb9c3f640d610ead693d95246b837b5c16990/eth-node/bridge/geth/waku.go#L65

events := make(chan wakucommon.EnvelopeEvent, 100) // must be buffered to prevent blocking whisper

Someone has been here already, clearly not this subscription, but yes, you are damn right it should be buffered.

Mmh friend is wrong. The buffer could still be full and it would block! that's a problem with the producer, it should never block, so the line above might still be causing the issue.

:facepalm:

cammellos commented 3 years ago

https://gobyexample.com/non-blocking-channel-operations That's what the producer should do, or at least timeout on a write, you can't block forever because a consumer is stuck come on.

cammellos commented 3 years ago

https://github.com/status-im/status-go/pull/2132 fixes some of the issues identified, we leak a lot of goroutines. It might not be the cause of this issue (though never say never, as it touches that part of code), but needs fixing in any case.

jakubgs commented 3 years ago

We thought it was fixed but we still some hosts becoming unreachable: https://media.discordapp.net/attachments/626045898878877696/804351174668058624/unknown.png

Clearly reached the timeout stage where it doesn't accept new peers or doesn't drop old ones:

jakubgs commented 3 years ago

What's interesting is the jump in goroutines after the flatline starts:

jakubgs commented 3 years ago

We can also see a big spike in container CPU usage at 13:48:00 when the peer flatline starts:

jakubgs commented 3 years ago

And also a drop to zero for container memory page faults:

jakubgs commented 3 years ago

Jump in allocated TCP sockets:

jakubgs commented 3 years ago

I've enabled DEBUG logs on all 01 waku nodes in eth.prod because INFO level gives me nothing useful.

jakubgs commented 3 years ago

Stack trace from stuck node-08.gc-us-central1-a.eth.prod: statusd_stack.log

cammellos commented 3 years ago

node-06.gc-us-central1-a.eth.prod got stuck again, from the stack trace we should be able to pin point hopefully which process is not reading from the go channel.

I have attached a list of processes that subscribed to the channel, not many, so we should probably be able to identify it. subscribers-2.txt

jakubgs commented 3 years ago

Canaries have been green ever since three changes have been merged:

https://github.com/status-im/status-go/pull/2132 - Initial fixes for goroutine leaks
https://github.com/status-im/status-go/pull/2137 - Then a deadlock fix in peerpool
https://github.com/status-im/status-go/pull/2139 - And finally l a fix to metrics handling

All three of those combined have resulted in stable status-go nodes:

And it's been green for over two days:

So I consider this resolved.

status-im / infra-eth-cluster

Research status-go not accepting new connections #39