Investigate higher thread usage since Polkadot 0.9.36

bkchr commented 1 year ago

People report that since switching to Polkadot 0.9.36 from 0.9.33 (last node release) they see a much higher thread usage. They report that the CPU usage didn't increased.

I think that this is maybe related to the switch of using tokio in libp2p. If it is confirmed, this issue can be closed. In general this is just some investigation to find the underlying change for this behavior.

Nexus2k commented 1 year ago

I'm the reporter of the issue, please find the two graphs attached: CPU load itself is roughly the same:

Thread usage seems to spike with 1k threads when polkadot node is in active set and producing blocks.

dmitry-markin commented 1 year ago

@Nexus2k Thanks for the graphs! Sorry for the dumb question, what tool do you use to collect the statistics?

Do you have the same thread count graph for v0.9.33 to better understand the difference?

bkchr commented 1 year ago

@dmitry-markin that should be grafana.

See Tuesday before it started to grow. We released on Tuesday, so that should be the point when he upgraded.

Nexus2k commented 1 year ago

@dmitry-markin I use https://checkmk.com/ to monitor the few dozens of servers I run for substrate blockchains.

Here's the graph for the last 8 days, the only changing part according to my knowledge was the polkadot release upgrade:

The node that produces this graph is somewhat permanently active in Kusama and became active in Polkadot on Tuesday around 16:30 CET. The release upgrade happened on 14:30 CET due to automatic upgrading on new docker image publishing.

dmitry-markin commented 1 year ago

The node that produces this graph is somewhat permanently active in Kusama and became active in Polkadot ...

Do you mean you run two nodes on the same server, and the number of threads is the total number of threads in the system?

I don't have access to any validator nodes, but would expect the number of polkadot threads to be on the order of 50-100 (I just measured 42 during syncing on v0.9.36 and 54 on v0.9.33).

Nexus2k commented 1 year ago

Correct, it's a rather beefy machine and can handle multiple chains in parallel. And yes it's the overall system threads which were rather flat before the upgrade as you can see. I've rebooted the system for good measure just to make sure it's not a problem for any other component.

Nexus2k commented 1 year ago

Here's a thread graph of another system that also shows annomalies after the upgrade:

That one is was not active during the upgrade but became active in Kusama at 19:28 CET

dmitry-markin commented 1 year ago

@Nexus2k Could you get the number of polkadot threads during the anomaly, something like ps -e -O nlwp | grep polkadot?

Nexus2k commented 1 year ago

# ps -e -O nlwp | grep polkadot
   3442   84 S ?        01:36:12 /usr/bin/polkadot --name=🍁 HIGH/STAKE 🥩 | HEL1-KSM -lsync=warn,afg=warn,babe=warn --chain=kusama --validator --rpc-methods=Unsafe --rpc-external --rpc-cors=all --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<REDACTED>/tcp/30334 --prometheus-external --no-mdns --pruning=1000 --telemetry-url=wss://telemetry.polkadot.io/submit/ 1 --telemetry-url=wss://telemetry-backend.w3f.community/submit/ 1
   3767   85 S ?        01:18:52 /usr/bin/polkadot --name=🍁 HIGH/STAKE 🥩 | HEL1-DOT -lsync=warn,afg=warn,babe=warn --chain=polkadot --validator --rpc-methods=Unsafe --rpc-external --rpc-cors=all --listen-addr=/ip4/0.0.0.0/tcp/30333 --public-addr=/ip4/<REDACTED>/tcp/30333 --prometheus-external --no-mdns --pruning=1000 --telemetry-url=wss://telemetry.polkadot.io/submit/ 1 --telemetry-url=wss://telemetry-backend.w3f.community/submit/ 1
   4326   44 S ?        00:02:11 /usr/bin/polkadot prepare-worker /tmp/pvf-host-prepareK3WuNH4EJm
   4351  362 S ?        00:00:29 /usr/bin/polkadot execute-worker /tmp/pvf-host-executeBFOEmXjdbI
   4352  562 S ?        00:02:26 /usr/bin/polkadot execute-worker /tmp/pvf-host-executetYaDNMqDdJ

dmitry-markin commented 1 year ago

It looks like the high thread count is caused not by networking code, but by PVF subsystem, which I'm unfamiliar with. @bkchr do you know who better to assign the issue to?

bkchr commented 1 year ago

Ahh it just aggregates all the threads of all running polkadot processes.

There is currently some rework going on around the pvf worker which should solve this. I'm going to close this, as this is nothing problematic.

However, thank you @dmitry-markin for looking that fast into it and ty for @Nexus2k for your help here :)

Ping @m-cat as you are the one rewriting the pvf worker currently. As I said above, I don't think that we need to take any special action here.

mrcnski commented 1 year ago

Thanks for the ping @bkchr. I wasn't aware that the thread count was so high, so it's good that we will be closing them properly in https://github.com/paritytech/polkadot/pull/6419.

For background, most of these threads are sleeping, which is why the CPU usage doesn't change. We don't currently kill them because killing threads is unsafe. But the PR above changes it so that we signal the thread to finish on its own, when the PVF job finishes.

paritytech / substrate

Investigate higher thread usage since Polkadot 0.9.36 #12988