petersulyok / smfc

Super Micro Fan Control
GNU General Public License v3.0
184 stars 15 forks source link

Feature Request: GPU temperature/activity bias #16

Open emansom opened 1 year ago

emansom commented 1 year ago

In workstation configurations inside tower cases, when running certain GPU heavy and low CPU workloads; it can lead to scenarios where the top case fans are not running at sufficient CFM for the hot air to be drawn upwards.

When the CPU temperature is low, while the GPU temperature is not.

The GPU (blower style fan) is then recycling its own pocket of hot air, instead of the case fans helping.

To combat this, a bias of sorts could be introduced that influences the curve based on GPU temperature and/or activity.

emansom commented 1 year ago

While testing locally, I think the easiest way to tackle this, is by the following algorithm:

When GPU activity peaked >30% in the last 60 seconds, increase minimum fan speed to atleast 55%.

This drasticly improves thermals by 10C (from 50C to 40C on moderate GPU load) in a workstation ATX tower case.

petersulyok commented 1 year ago

Hi @emansom, some questions came into my mind on this topic:

Let me know your view on this.

emansom commented 1 year ago
  • How would you read GPU temperature? I know only vendor specific tools (nvidia, amd etc.) for reading temperature but I do not know about standard interface (like HWMON)

One way to tackle this would be to take a look how nvtop implemented this and port this to a seperate Python module, e.g. called python-gpustats or similar.

For less code duplication, it would be useful to abstract this behind a shared library written in C with Python bindings so both projects could utilize the same paths and it would be somewhat agnostic.

There may already exist a library for this. I have not searched wide, nor asked around.

emansom commented 1 year ago
  • Biggest bottleneck that IPMI has only two zones defined (CPU and HD), I assume you would like to use CPU for this purpose, right?

Given there exists a multitude of case configurations, I think the zones should be configurable, defaulting to all for optimal airflow.

As increasing all zones by GPU load percentage and GPU temperature would result in the best temperatures.

Some users may prefer lower noise however, so this increase should be configurable.

  • Do you think if multiple GPUs should be supported?

I think it should loop over all GPUs in the system with the increase bias taking effect if any of them show load or have higher temperatures (not if-else based, just addition based math)