iron-fish / ironfish

A novel cryptocurrency focused on privacy and accessibility.
https://ironfish.network
Mozilla Public License 2.0
963 stars 574 forks source link

Improve the calculation for the number of workers in the worker pool #5041

Closed andiflabs closed 3 months ago

andiflabs commented 3 months ago

Summary

This commit changes the default value for nodeWorkersMax to the number of CPU cores, minus one, capped at 16. Before this change, nodeWorkersMax was hardcoded to 6.

As a consequence of this change, most users with empty configuration will observe a larger number of workers spawned. Users who run on systems with SMT enabled and wi.

For example: on a system with 2 cores, with SMT enabled (4 logical CPUs in total), and empty configuration: before this change, the number of workers would have been .

The rationale behind using the number of CPU cores rather than the number of logical CPUs is that the worker pool should mostly be used for CPU-bound and CPU-intensiv.

This commit also changes the default value for nodeWorkers to be based on the number of processing units available to the process, rather than the number of logical.

Some examples to understand the impact of this change:

Testing Plan

Documentation

Does this change require any updates to the Iron Fish Docs (ex. the RPC API Reference)? If yes, link a related documentation pull request for the website.

[ ] Yes

Breaking Change

Is this a breaking change? If yes, add notes below on why this is breaking and label it with breaking-change-rpc or breaking-change-sdk.

[ ] Yes
NullSoldier commented 3 months ago

This PR looks good, but I want to add some context for discussion because I know node-runners that won't be able to start their node now may be coming here for questions. Technically this PR makes perfect sense, but business wise there are some considerations.

First, user's have a wide variety of configurations. We have users with high mem/low cpu, high cpu/low mem, low mem/low cpu. During our testnet phases we required users to run nodes to accumulate testnest activity points. This means that during our testnet phase, we had a large range of low end VPS's being rented to run Iron Fish nodes that probably had very little memory, and often many more cores. On top of that, there were a few 100 core user's running nodes that would brag and show off their HTOP.

Now onto the next consideration. Node is a memory hungry beast. Unfortunately because JS is single threaded, if you want parallelism you need to use a node forked process worker pool. That means that our worker pool spawns essentially new node runtimes for each worker. That takes 100+ MB for each new worker spawned. If you have 10 physical cores you'll need enough memory for the main node process and 9 cores.

Now back to the original of maxNodeWorkers. During our testnet we had many users that could simply not start their nodes when we increased the amount of parallelism because their systems would scale up and unexpectedly run out of memory. The reason we introduced was to because we made a trade off. We decided arbitrarily that 6 workers was enough.

The reason we wanted to cap it down was to make most user's node's run without crashing, and because we wanted developers to develop with the understanding of what it felt like to run a low end / medium node and have sympathy for the user. We found that when we capped the performance of the node to an expected average user's machine, engineers contributed optimizations much more frequently. We also no longer received reports of crashing from OOO on startup from the worker pool spawning workers.

So this might not be a problem anymore, but I think the PR still should address the original reason maxNodeWorkers existed and that you should make the case that it isn't needed anymore or that our end user's won't be impacted if we make this change. I think some may no longer be able to start their nodes without tweaking the configuration, and we can decide that's ok.