Closed jakubgs closed 3 years ago
Based on alerts from tonight it seems like this is mostly affecting Google Cloud US hosts:
But that might be just a coincidence.
Just to be sure I built and ran the node-canary
locally on the host affected by this issue:
admin@node-08.gc-us-central1-a.eth.prod:~ % node-canary -log=WARN -staticnode='enode://e87f1d8093d304c3a9d6f1165b85d6b374f1c0cc907d39c0879eb67f0a39d779be7a85cbd52920b6f53a94da43099c58837034afa6a7be4b099bfcd79ad13999@35.238.106.101:443'
ERROR[01-22|08:30:09.197] Error detected while waiting for static peer to be added package=status-go/cmd/node-canary error="wait for peer: timeout"
Same result. So I'd say it's not related to firewall rules or networking.
The number of file descriptors looks fine:
% sudo ls /proc/$(pgrep statusd)/fd | wc -l
105
As does the number of sockets:
% sudo cat /proc/$(pgrep statusd)/net/sockstat
sockets: used 218
TCP: inuse 8 orphan 0 tw 2 alloc 511 mem 280
UDP: inuse 1 mem 4
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
The state of connections within the container doesn't look weird:
bash-4.4# netstat -pnt | awk '{a[$6]++}END{for(x in a){printf "%5d - %s\n", a[x], x}}' | sort
1 - TIME_WAIT
29 - ESTABLISHED
110 - CLOSE_WAIT
When I used strace
on the process I can see a lot of those:
[pid 1304] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
I counted about 50000 in 10 seconds.
Essentially what the strace
shows is that the process is using a futex
to handle connections:
[pid 1295] futex(0x279b490, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 1287] <... futex resumed>) = 0
[pid 1295] <... futex resumed>) = 1
[pid 1289] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1287] sched_yield() = 0
[pid 1295] futex(0x279f540, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=36878} <unfinished ...>
[pid 1289] futex(0x279f5c0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=58797} <unfinished ...>
[pid 1287] futex(0x279b390, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1295] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1289] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1287] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 1295] futex(0x279f540, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=49659} <unfinished ...>
[pid 1289] futex(0x279f5c0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=21047} <unfinished ...>
[pid 1295] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 1289] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
...
As compared to a healthy process:
[pid 81832] futex(0xc00005f9c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 81823] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 82254] read(98, <unfinished ...>
[pid 81824] <... futex resumed>) = 0
[pid 81770] <... futex resumed>) = 1
[pid 82254] <... read resumed>"=\304\200x0\367\351\322\344\203\232\267\201\244\316\225\341P\374\233,\177e#\301\33\355\nM\324\350\206", 32) = 32
[pid 81823] <... nanosleep resumed>NULL) = 0
[pid 81824] futex(0xc00005f9c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 82254] read(98, <unfinished ...>
[pid 81823] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 81824] <... futex resumed>) = 1
[pid 82254] <... read resumed>"o\235\201\351\310\30\225L7\304\334-\201\2\373_\317\235\227\344\312\276XpYF\275s\\\357\327\215"..., 48) = 48
[pid 81832] <... futex resumed>) = 0
[pid 81770] futex(0x279f5c0, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=348719} <unfinished ...>
[pid 82254] read(98, <unfinished ...>
[pid 81832] epoll_pwait(4, <unfinished ...>
[pid 81824] futex(0x279f540, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=227266} <unfinished ...>
[pid 82254] <... read resumed>"\\yA\22\251\344\224\32\324\251\223\206&s\201\252", 16) = 16
[pid 81832] <... epoll_pwait resumed>[{EPOLLOUT, {u32=596904736, u64=140050890492704}}], 128, 0, NULL, 0) = 1
[pid 81823] <... nanosleep resumed>NULL) = 0
[pid 82254] write(98, "\"\325\4\2054\211,\300\242\302\0324\375l\271xB3\3711?\307u\346\2\331\1\236A\240[\222", 32 <unfinished ...>
[pid 81823] nanosleep({tv_sec=0, tv_nsec=20000}, <unfinished ...>
[pid 81832] read(98, <unfinished ...>
...
Where we can see use of epoll_pwait
, read
, and write
when communicating with other peers.
We can see a pretty sharp drop in envelopes received at 17:08:00
UTC:
We can also see it pretty clearly in a drop of packets received and sent:
IRQs drop off too:
There is no visible spike or drop in container memory usage:
But there is a drop in CPU usage for one core:
We can see a spike in TCP connections aborted due to timeouts and the other end closing them:
But for some reason we also see a raise in connections:
This metric is called netdata_netfilter_conntrack_sockets_active_connections_average
and according to the description it is:
Netfilter Connection Tracker performance metrics. The connection tracker keeps track of all connections of the machine, inbound and outbound. It works by keeping a database with all open connections, tracking network and address translation and connection expectations.
We can also see a spike in allocated TCP sockets:
And a drop off in orphan sockets.
I enabled pprof
port on node-08.gc-us-central1-a.eth.prod
so we can investigate further when it gets the issue again.
The first thing I looked for is deadlocks. So searching for Lock
will show us which
goroutines are stuck:
sync.(*RWMutex).Lock(0xc000176348)
#011/usr/local/go/src/sync/rwmutex.go:103 +0x88
github.com/status-im/status-go/peers.(*TopicPool).processFoundNode(0xc000176300, 0xc0004ef080, 0xc004c16620, 0x0, 0x0)
#011/go/src/github.com/status-im/status-go/peers/topicpool.go:482 +0x5c
Logs line: 965
sync.(*Mutex).Lock(...)
#011/usr/local/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc000176348)
#011/usr/local/go/src/sync/rwmutex.go:98 +0x97
github.com/status-im/status-go/peers.(*TopicPool).limitFastMode.func1(0xc000176300, 0x6fc23ac00, 0xc004fa9e00)
#011/go/src/github.com/status-im/status-go/peers/topicpool.go:291 +0x13b
Logs line: 23659
Are both waiting on the same Lock, which indicates that there might be a third process keeping that lock.
So we look for goroutines stuck on TopicPool
code.
goroutine 194 [select, 4 minutes]:
github.com/ethereum/go-ethereum/p2p.(*Server).AddPeer(0xc0004ef080, 0xc001e42070)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:327 +0xbd
github.com/status-im/status-go/peers.(*TopicPool).addServerPeer(0xc000176300, 0xc0004ef080, 0xc001ad3bc0)
#011/go/src/github.com/status-im/status-go/peers/topicpool.go:541 +0x7b
github.com/status-im/status-go/peers.(*TopicPool).AddPeerFromTable(0xc000176300, 0xc0004ef080, 0x0)
There are a few, but this looks suspicious, as we know that we can't add new peers really, so a call stuck on AddPeer
would make sense.
And it's calling from:
Which locks. So it seems the deadlock is due to the p2p
server being stuck. So that's the next thing to analyze.
Let's check what the server is doing. We know the main loop of the server is probably stuck, so we look for it:
and that's the server loop:
and we find it in the logs:
goroutine 113 [select, 4 minutes]:
github.com/ethereum/go-ethereum/p2p.(*Peer).Disconnect(0xc0032b98c0, 0x0)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/peer.go:164 +0xb7
github.com/ethereum/go-ethereum/p2p.(*Server).run(0xc0004ef080, 0x1bf4f40, 0xc0000c2c40)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:722 +0x17cc
created by github.com/ethereum/go-ethereum/p2p.(*Server).Start
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:505 +0x4aa
stuck on Disconnect
called on a peer.
Peer code:
From the peer code we see that is publishing on a channel. That means the channel is full/not being processed, so it's blocking.
Who is processing w.disc
and why is stuck? We'll look for it
That's the processing loop and it's in the run()
function of peer:
Ok, now it's tricky. There are many peers
objects, so we will have one go routine for each.
We need to find the right one.
We have the address of the object from the previous stack trace
github.com/ethereum/go-ethereum/p2p.(*Peer).Disconnect(0xc0032b98c0, 0x0)
And the address is this one:
0xc0032b98c0
We'll look for the address in the logs, and we find it:
goroutine 9357 [chan receive, 4 minutes]:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0004ef208, 0x14db580, 0xc003d67680, 0xc00046d760)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/event/feed.go:133 +0x110
github.com/ethereum/go-ethereum/p2p.(*Server).runPeer(0xc0004ef080, 0xc0032b98c0)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:1053 +0x1e9
Ok, so it's not the peer that is stuck. We are not actually running the peer loop,
as the goroutine handling this is stuck at Feed.Send
, which should not really be
blocking.
So that's where we go next.
Let's pull the specific feed address:
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0004ef208, 0x14db580, 0xc003d67680, 0xc00046d760)
therefore the address is:
0xc0004ef208
Ok, many, many goroutines stuck on Send
, let's just have a look at what send does:G
All stuck on sendLock
, so the send lock is on.
We need to find another stack trace that it's not stack there, as it means someone is keeping that lock:
Ok, this took a bit of time but:
goroutine 5758 [select, 4 minutes]:
reflect.rselect(0xc006111a00, 0x2, 0x2, 0x17e9ff1, 0xe)
#011/usr/local/go/src/runtime/select.go:542 +0x38a
reflect.Select(0xc001a71080, 0x2, 0x3, 0x14db580, 0xc000475c80, 0x16, 0x1, 0x0)
#011/usr/local/go/src/reflect/value.go:2229 +0x170
github.com/ethereum/go-ethereum/event.(*Feed).Send(0xc0004ef208, 0x14db580, 0xc000475c80, 0x27d9880)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/event/feed.go:170 +0x594
github.com/ethereum/go-ethereum/p2p.(*Server).runPeer(0xc0004ef080, 0xc0037d56e0)
#011/go/src/github.com/status-im/status-go/vendor/github.com/ethereum/go-ethereum/p2p/server.go:1064 +0x3f5
created by github.com/ethereum/go-ethereum/p2p.(*Server).run
Here's our process that is hanging
Ok, this looks really weird, is using the reflect package to Select
on a bunch of statement.
Unfortunately this stack trace ends here, as we don't know what the content of that cases
array is, as it's not printed in the stack trace.
We will have a look at the producers that are adding select statements and understand where those might be blocking.
To be continued.....
Notes (same content): https://notes.status.im/mailserver-stuck-error-logs
Ok, back where we left off.
We need to look how things get added to sendCases
.
We can see that it's a bit indirect, first they get added to m.inbox
and eventually added back on Send
to test cases.
Here they get added:
which is the Subscribe
function call:
And eventually they make their way to sendCases
:
So the code that hangs is some code that at some point calls Subscribe
and does not read on that channel.
This look extremely flimsly code, but what do I know.
Ok, these seems to be the place where we use event.Feed
api/nimbus_backend.go
api/geth_backend.go
vendor/github.com/ethereum/go-ethereum/p2p/peer.go
vendor/github.com/ethereum/go-ethereum/p2p/server.go
vendor/github.com/ethereum/go-ethereum/p2p/metrics.go
vendor/github.com/ethereum/go-ethereum/p2p/message.go
vendor/github.com/ethereum/go-ethereum/core/blockchain.go
vendor/github.com/ethereum/go-ethereum/core/tx_pool.go
vendor/github.com/ethereum/go-ethereum/accounts/manager.go
vendor/github.com/ethereum/go-ethereum/accounts/scwallet/hub.go
vendor/github.com/ethereum/go-ethereum/accounts/keystore/keystore.go
vendor/github.com/ethereum/go-ethereum/accounts/usbwallet/hub.go
vendor/github.com/ethereum/go-ethereum/light/txpool.go
vendor/github.com/ethereum/go-ethereum/light/lightchain.go
vendor/github.com/ethereum/go-ethereum/eth/downloader/peer.go
whisper/whisper.go
waku/waku.go
services/wallet/service.go
services/wallet/transmitter.go
services/wallet/reactor.go
services/wallet/service_test.go
services/wallet/commands_test.go
services/wallet/commands.go
services/accounts/service.go
services/accounts/accounts.go
services/local-notifications/core_test.go
services/local-notifications/core.go
This will take a while to figure out, but the plan is to look at those places, find the channel, find the consumer, and check on the logs if there's anything stuck consuming that channel. Fun time ahead.
Ok, a lot of these are either in the ethereum codebase, so I should assume that bugs are not there as more people would have reported the issue, or they are on services that are not started in normal nodes. So first I will start investigating from our code, namely whisper/waku.
Here's is where it gets called:
and we subscribe to the events, let's look for a consumer of that channel.
Many :(
eth-node/bridge/geth/waku.go
eth-node/bridge/geth/whisper.go
eth-node/bridge/nimbus/whisper.go
eth-node/types/waku.go
eth-node/types/whisper.go
whisper/whisper.go
waku/waku.go
cmd/node-canary/main.go
services/shhext/api_nimbus.go
services/shhext/service.go
services/shhext/api_geth.go
services/shhext/service_nimbus.go
services/wakuext/api.go
services/ext/mailrequests.go
services/ext/mailservers/connmanager.go
services/ext/mailservers/connmonitor.go
protocol/transport/whisper/envelopes.go
protocol/transport/waku/envelopes.go
protocol/transport/waku/waku_service.go
Oh hello friend:
events := make(chan wakucommon.EnvelopeEvent, 100) // must be buffered to prevent blocking whisper
Someone has been here already, clearly not this subscription, but yes, you are damn right it should be buffered.
Mmh friend is wrong. The buffer could still be full and it would block! that's a problem with the producer, it should never block, so the line above might still be causing the issue.
:facepalm:
https://gobyexample.com/non-blocking-channel-operations That's what the producer should do, or at least timeout on a write, you can't block forever because a consumer is stuck come on.
https://github.com/status-im/status-go/pull/2132 fixes some of the issues identified, we leak a lot of goroutines. It might not be the cause of this issue (though never say never, as it touches that part of code), but needs fixing in any case.
We thought it was fixed but we still some hosts becoming unreachable: https://media.discordapp.net/attachments/626045898878877696/804351174668058624/unknown.png
Clearly reached the timeout stage where it doesn't accept new peers or doesn't drop old ones:
What's interesting is the jump in goroutines after the flatline starts:
We can also see a big spike in container CPU usage at 13:48:00
when the peer flatline starts:
And also a drop to zero for container memory page faults:
Jump in allocated TCP sockets:
I've enabled DEBUG
logs on all 01
waku nodes in eth.prod
because INFO
level gives me nothing useful.
Stack trace from stuck node-08.gc-us-central1-a.eth.prod
: statusd_stack.log
node-06.gc-us-central1-a.eth.prod
got stuck again, from the stack trace we should be able to pin point hopefully which process is not reading from the go channel.
I have attached a list of processes that subscribed to the channel, not many, so we should probably be able to identify it. subscribers-2.txt
Canaries have been green ever since three changes have been merged:
peerpool
All three of those combined have resulted in stable status-go
nodes:
And it's been green for over two days:
So I consider this resolved.
We've seen an issue where after some time in operation a
status-go
node stops accepting new connections:This example is
node-08.gc-us-central1-a.eth.prod
. We can also see itsp2p_peers_absolute
metrics flatline: