Closed bkchr closed 1 year ago
I'm the reporter of the issue, please find the two graphs attached: CPU load itself is roughly the same:
Thread usage seems to spike with 1k threads when polkadot node is in active set and producing blocks.
@Nexus2k Thanks for the graphs! Sorry for the dumb question, what tool do you use to collect the statistics?
Do you have the same thread count graph for v0.9.33 to better understand the difference?
@dmitry-markin that should be grafana.
See Tuesday before it started to grow. We released on Tuesday, so that should be the point when he upgraded.
@dmitry-markin I use https://checkmk.com/ to monitor the few dozens of servers I run for substrate blockchains.
Here's the graph for the last 8 days, the only changing part according to my knowledge was the polkadot release upgrade:
The node that produces this graph is somewhat permanently active in Kusama and became active in Polkadot on Tuesday around 16:30 CET. The release upgrade happened on 14:30 CET due to automatic upgrading on new docker image publishing.
The node that produces this graph is somewhat permanently active in Kusama and became active in Polkadot ...
Do you mean you run two nodes on the same server, and the number of threads is the total number of threads in the system?
I don't have access to any validator nodes, but would expect the number of polkadot threads to be on the order of 50-100 (I just measured 42 during syncing on v0.9.36 and 54 on v0.9.33).
Correct, it's a rather beefy machine and can handle multiple chains in parallel. And yes it's the overall system threads which were rather flat before the upgrade as you can see. I've rebooted the system for good measure just to make sure it's not a problem for any other component.
Here's a thread graph of another system that also shows annomalies after the upgrade:
That one is was not active during the upgrade but became active in Kusama at 19:28 CET
@Nexus2k Could you get the number of polkadot threads during the anomaly, something like ps -e -O nlwp | grep polkadot
?
# ps -e -O nlwp | grep polkadot
3442 84 S ? 01:36:12 /usr/bin/polkadot --name=🍁 HIGH/STAKE 🥩 | HEL1-KSM -lsync=warn,afg=warn,babe=warn --chain=kusama --validator --rpc-methods=Unsafe --rpc-external --rpc-cors=all --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<REDACTED>/tcp/30334 --prometheus-external --no-mdns --pruning=1000 --telemetry-url=wss://telemetry.polkadot.io/submit/ 1 --telemetry-url=wss://telemetry-backend.w3f.community/submit/ 1
3767 85 S ? 01:18:52 /usr/bin/polkadot --name=🍁 HIGH/STAKE 🥩 | HEL1-DOT -lsync=warn,afg=warn,babe=warn --chain=polkadot --validator --rpc-methods=Unsafe --rpc-external --rpc-cors=all --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<REDACTED>/tcp/30333 --prometheus-external --no-mdns --pruning=1000 --telemetry-url=wss://telemetry.polkadot.io/submit/ 1 --telemetry-url=wss://telemetry-backend.w3f.community/submit/ 1
4326 44 S ? 00:02:11 /usr/bin/polkadot prepare-worker /tmp/pvf-host-prepareK3WuNH4EJm
4351 362 S ? 00:00:29 /usr/bin/polkadot execute-worker /tmp/pvf-host-executeBFOEmXjdbI
4352 562 S ? 00:02:26 /usr/bin/polkadot execute-worker /tmp/pvf-host-executetYaDNMqDdJ
It looks like the high thread count is caused not by networking code, but by PVF subsystem, which I'm unfamiliar with. @bkchr do you know who better to assign the issue to?
Ahh it just aggregates all the threads of all running polkadot processes.
There is currently some rework going on around the pvf worker which should solve this. I'm going to close this, as this is nothing problematic.
However, thank you @dmitry-markin for looking that fast into it and ty for @Nexus2k for your help here :)
Ping @m-cat as you are the one rewriting the pvf worker currently. As I said above, I don't think that we need to take any special action here.
Thanks for the ping @bkchr. I wasn't aware that the thread count was so high, so it's good that we will be closing them properly in https://github.com/paritytech/polkadot/pull/6419.
For background, most of these threads are sleeping, which is why the CPU usage doesn't change. We don't currently kill them because killing threads is unsafe. But the PR above changes it so that we signal the thread to finish on its own, when the PVF job finishes.
People report that since switching to Polkadot 0.9.36 from 0.9.33 (last node release) they see a much higher thread usage. They report that the CPU usage didn't increased.
I think that this is maybe related to the switch of using tokio in libp2p. If it is confirmed, this issue can be closed. In general this is just some investigation to find the underlying change for this behavior.