markusressel / fan2go

A simple daemon providing dynamic fan speed control based on temperature sensors.
GNU Affero General Public License v3.0
234 stars 22 forks source link

Dell Server - Cycles without Settling #201

Open tjbennett opened 1 year ago

tjbennett commented 1 year ago

Describe the bug Dell Server 3930 - has 3 fans all of which can be controlled by a temperature sensor (or many sensors). Attempting to replace fancontrol with an easier, smoother method.

Install fan2go and tried to build a yaml file to represent server conditions

Use a definition of single cpu fan and representative sensor (cpu temp sensor).

In Dell 3930, /sys/class/hwmon/hwmon3/pwm1 represents as 128 even when in slow cycles. Therefore reading back a PWM setting is fraught with issues. A PWM setting should represent an RPM value and instead of reading back the PWM settings a relation to the RPM would be more representative of the PWM setting.

To Reproduce

Install fan2go, using go 1.20, and set yaml as following:

dbPath: "/etc/fan2go/fan2go.db"

runFanInitializationInParallel: false
maxRpmDiffForSettledFan: 250

tempSensorPollingRate: 2000ms
tempRollingWindowSize: 10

rpmPollingRate: 1s
rpmRollingWindowSize: 10

controllerAdjustmentTickRate: 200ms

fans:
  - id: cpu
    hwmon:
      platform: dell_smm-virtual-0
      index: 1
    neverStop: true
    curve: cpu_curve
    minPwm: 128
    startPwm: 128
    maxPwm: 255
    pwmMap:
      0 : 128
      170: 138
      255: 255

  - id: pcie
    hwmon:
      platform: dell_smm-virtual-0
      index: 2
    neverStop: true
    curve: pcie_curve
    minPwm: 128
    startPwm: 128
    maxPwm: 255
    pwmMap:
      0 : 128
      170: 138
      255: 255

  - id: psup
    hwmon:
      platform: dell_smm-virtual-0
      index: 3
    neverStop: true
    curve: psup_curve
    minPwm: 128
    startPwm: 128
    maxPwm: 255
    pwmMap:
      0 : 128
      170: 138
      255: 255

sensors:
  - id: cpu_package
    hwmon:
      platform: dell_smm-virtual-0
      index: 1

curves:
  - id: cpu_curve
    linear:
      sensor: cpu_package
      steps:
        - 40: 128
        - 50: 150
        - 80: 255

  - id: pcie_curve
    linear:
      sensor: cpu_package
      steps:
        - 40: 128
        - 50: 150
        - 80: 255

  - id: psup_curve
    linear:
      sensor: cpu_package
      steps:
        - 40: 128
        - 50: 150
        - 80: 255

  - id: case_avg_curve
    function:
      type: average
      curves:
        - cpu_curve
        - pcie_curve
        - psup_curve
  1. Start fan2go - in this instance used superuser "# go run main.go -c "fan2go.yaml""
  2. Observe system fail as writing "0" to pwm does not result in pwm reading as "0"
  3. Remove the check on matching values and get the system building the curve information
  4. Once that is done, let the system run. It will go high (RPM set to 255) then drop to low (128) in a cyclic manner, not in any way related to the temp sensor
  5. Run "stress --cpu 6" temperature rises and system does not adjust to cool system

Expected behavior Expect the system to ramp up fan speed when temperature increases, slowly ramp fan system down as system cools.

Desktop (please complete the following information):

Additional context There may be a great misunderstanding of the configuration or the intent of this program that has created this report.

markusressel commented 1 year ago

Hey @tjbennett thx for the report :rocket:

FYI: For better readability, I have removed the comments from your configuration.

Observe system fail as writing "0" to pwm does not result in pwm reading as "0"

fan2go should not fail because of this, it should issue warnings about it though.

Remove the check on matching values and get the system building the curve information

If you set minPwm and maxPwm, building the curve is (currently) without use, since the curve is only used to determine these values. If they are already set in the config, the curve is not build because it is not necessary.

Once that is done, let the system run. It will go high (RPM set to 255) then drop to low (128) in a cyclic manner, not in any way related to the temp sensor

You specify a custom pwmMap definition for all of your fans:

  minPwm: 128
  startPwm: 128
  maxPwm: 255
  pwmMap:
    0 : 128
    170: 138
    255: 255

How did you get to these values? Are there no possible PWM values other than the three you have specified?

Setting a minPwm of 128 in combination with the given pwmMap will result in the fan always beeing set to at least 170, which will be mapped to a PWM value of 138 due to your config. There might be a misunderstanding in how the pwmMap config option works. Check out #161 (caution, very long thread) to get a more in-depth understanding of what the pwmMap config option is doing.

Increasing the value of tempSensorPollingRate to a 10 times greater value than default might also play into this. Can you test if reverting this to the default 200ms fixes the cycling?

tjbennett commented 1 year ago

@markusressel Your response is helpful, I will go through the link you sent and using more knowledge attempt this again... hopefully by tomorrow.

tjbennett commented 1 year ago

So @markusressel I have done that reading and found the similar pattern in the Dell server as a PWM input/RPM output. I had completely misread/misunderstood the PWM settings and pwmMap.

The 3930 has a step functionality:

[0-64] -> idling fan, speed not truly constant [65-191] -> speed step to ~4000RPM [192-255] -> speed step to ~1490 RPM

Graph RPM v PWM ![annotated](https://user-images.githubusercontent.com/375072/218115263-31cda5e5-58df-4820-8533-bf1200bc0b4a.jpg)

The PWM value read back from /sys/class/hwmon/hwmon3/pwm1, is always 128 until the value written to pwm1 hits 192 and there is a pause for 6 seconds

I simplified the yaml file and get an "invalid argument" panic.

Graph RPM v PWM dbPath: "/etc/fan2go/fan2go.db" runFanInitializationInParallel: false maxRpmDiffForSettledFan: 250 tempSensorPollingRate: 200ms tempRollingWindowSize: 10 rpmPollingRate: 1s rpmRollingWindowSize: 10 controllerAdjustmentTickRate: 200ms fans: - id: cpu hwmon: platform: dell_smm-isa-0000 index: 1 curve: cpu_curve pwmMap: 0: 128 68: 128 192: 255 sensors: - id: cpu_package hwmon: platform: dell_smm-isa-0000 index: 1 curves: - id: cpu_curve linear: sensor: cpu_package steps: - 40: 0 - 65: 68 - 85: 192

Panic output: using sudo fan2go -c"./fan2go.yaml"`

INFO Using configuration file at: ./fan2go.yaml
INFO Gathering sensor data for cpu...
INFO Loading fan curve data for fan 'cpu'...
WARNING Fan 'cpu' has not yet been analyzed, starting initialization sequence...
INFO Using pwm map override from config...
DEBUG  Distinct PWM value targets of fan cpu: [0 192]
INFO Measuring RPM curve...
DEBUG  Fan cpu: Actual PWM value differs from requested one, skipping. Requested: 0 Actual: 128
DEBUG  Fan cpu: Actual PWM value differs from requested one, skipping. Requested: 1 Actual: 128
ERROR  Cant attach empty fan curve data to fan cpu
ERROR  Failed to attach fan curve data to fan cpu: invalid argument
INFO Fan controller for fan cpu stopped.
WARNING Cannot send notification, missing env variable 'DISPLAY'!

Panic output when running sudo fan2go fan init -c "./fan2go.yaml" -i cpu

INFO Using configuration file at: ./fan2go.yaml
INFO Using persistence at: /etc/fan2go/fan2go.db
INFO Deleting existing data for fan 'cpu'...
INFO Using pwm map override from config...
INFO Measuring RPM curve...
  ERROR  Cant attach empty fan curve data to fan cpu
  ERROR  Failed to attach fan curve data to fan cpu: invalid argument

Any thoughts on what I have missed here?

markusressel commented 1 year ago

Firstly I would suggest to set minPwm, startPwmand maxPwm values, to skip the fan curve analysis. If the fan can only be set to three states, it's not possible to accurately measure it's curve, and the algorithm will fail (apparently in a spectacular manner by panicking 😅 ).

Given your PWM map is accurate, min and start can be set to 0, max can be set to 192. Fan control should then start even with a "fresh install" (or removed fan curve).

The fact that fan2go is panicking is not great though, I will see if there is something we can do about this.