input-output-hk / jormungandr

privacy voting blockchain node
https://input-output-hk.github.io/jormungandr/
Apache License 2.0
364 stars 132 forks source link

Repeated pattern of CPU consumption increase followed by desync event #1543

Open johnalotoski opened 4 years ago

johnalotoski commented 4 years ago

Describe the bug

With commit https://github.com/input-output-hk/jormungandr/commit/b45d23555ad0a986aa6accfbf3ac25609a5f5407, the TCP ACCEPT queue overflow is no longer a problem and there are no significant number sockets in the CLOSE_WAIT state anymore. However, this also suffers from a repeated pattern of CPU consumption increase followed by desync event as seen in the following image:

image

Providing a more powerful (more physical cores) machine still yields the same pattern. Basically the serverX threads pull an increasing amount of CPU each from logical core then RECV backs up as seen in netstat -tn and jormungandr desyncs/disconnects. This doesn't happen instantly, but over a period of a few minutes when maxConns is at 256, and about 30 minutes with maxConns at ~70.

jcli 0.8.5 (nix-build-b45d235, debug, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)] jormungandr 0.8.5 (nix-build-b45d235, debug, linux [x86_64]) - [rustc 1.38.0 (625451e37 2019-09-23)]

I can provide a gdb backtrace of threads from a debug build when CPU consumption is high if that would help.

michaeljfazio commented 4 years ago

This is the single most important bug to fix currently in Jormungandr. I believe that it is most likely a result of unbounded task queues being fed to tokio async runtime. Or possibly a building up of some tasks which enter an infinite loop and starve all others of cpu time. This inevitably results in starvation of core services and causes the node to loose sync.

As a consequence. Node operators are continually restarting, which is almost certainly introducing additional strain of bootstrap nodes.