raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.02k stars 4.95k forks source link

CM4 IO Board EMC2301 Fan controller driver (emc2305) issue: pwm can rise but cannot decrease #5681

Open julienrobin28 opened 10 months ago

julienrobin28 commented 10 months ago

Describe the bug

Having successfully enabled the Compute Module 4 IO Board embedded fan controller and RTC from config.txt using the following lines:

dtparam=i2c_arm=on
dtoverlay=i2c-rtc,pcf85063a,i2c_csi_dsi
dtoverlay=i2c-fan,emc2301,i2c_csi_dsi

It is successfully made accessible from sysfs, to get the current fan speed at /sys/class/hwmon/hwmon2/fan1_input, and the PWM value (used by 4 wires fans) can also be set and read back from /sys/class/hwmon/hwmon2/pwm1

However: the bug

It is only possible to increase the value of pwm1. For example, you can go from 0 to 102 (the fan accelerates, and reading back the file, value is confirmed to be 102), then you can't go back to 0 (if you try, the fan won't slow down, and reading back the file, the value has been left to 102).

You can then go to 255 successfully (fan accelerates even more) then, you won't be able to go back to less than 255.

Unless you are fast enough!

As soon as you go from 102 to 255, the fan accelerates, but if less than 1 second after, you go back to 102, then, the fan decelerates. You are back to 102. The more you wait, the less likely it is to work. Sometimes 3 seconds is still OK, sometimes, 2 seconds is too late...! However, even if successfully back to 102, you can't go back to 0 (if 102 has been set for too long, it became the new minimal value).

Resetting driver unlocks the value back to 0

When the fan's PWM is stuck to 255 for example, you can run the following commands:

echo "Resetting driver..."
modprobe -r emc2305
sleep 0.5
modprobe emc2305
echo "Done."

The value is back to 0. But if you raise it, the same issue will be back.

Steps to reproduce the behaviour

I made a little script to find out how many seconds are needed for the current value to be locked as minimum value. This may be used to reproduce the issue (even with no fan connected, but you'll definitely need an emc2301!)

#!/bin/sh

echo "Resetting driver..."

modprobe -r emc2305
sleep 0.5
modprobe emc2305
sleep 0.5

echo "Done. Current pwm state:"
cat /sys/class/hwmon/hwmon2/pwm1

sleep 2

echo "About to do 1s test..."
sleep 1
echo 255 > /sys/class/hwmon/hwmon2/pwm1
sleep 1
echo 0 > /sys/class/hwmon/hwmon2/pwm1
echo "Done. Current pwm state:"
cat /sys/class/hwmon/hwmon2/pwm1

sleep 2

echo "About to do 2s test..."
sleep 1
echo 255 > /sys/class/hwmon/hwmon2/pwm1
sleep 2
echo 0 > /sys/class/hwmon/hwmon2/pwm1
echo "Done. Current pwm state:"
cat /sys/class/hwmon/hwmon2/pwm1

sleep 2

echo "About to do 3s test..."
sleep 1
echo 255 > /sys/class/hwmon/hwmon2/pwm1
sleep 3
echo 0 > /sys/class/hwmon/hwmon2/pwm1
echo "Done. Current pwm state:"
cat /sys/class/hwmon/hwmon2/pwm1

sleep 2

echo "About to do 4s test..."
sleep 1
echo 255 > /sys/class/hwmon/hwmon2/pwm1
sleep 4
echo 0 > /sys/class/hwmon/hwmon2/pwm1
echo "Done. Current pwm state:"
cat /sys/class/hwmon/hwmon2/pwm1

echo "Over."

Example of output:

Resetting driver...
Done. Current pwm state:
0
About to do 1s test...
Done. Current pwm state:
0
About to do 2s test...
Done. Current pwm state:
255
About to do 3s test...
Done. Current pwm state:
255
About to do 4s test...
Done. Current pwm state:
255
Over.

Device (s)

Raspberry Pi CM4

System

cat /etc/rpi-issue:

Raspberry Pi reference 2023-10-10
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 962bf483c8f326405794827cce8c0313fd5880a8, stage2

vcgencmd version:

Aug 10 2023 15:33:38 
Copyright (c) 2012 Broadcom
version 03dc77429335caee083e22ddc8eec09c07f12a7a (clean) (release) (start)

uname -a: Linux crobe-server-coudray 6.1.0-rpi4-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

vcgencmd bootloader_version:

2023/05/11 07:26:03
version 4fd8f1f3f7a05f7756edb1d3f15ffd7e428981f5 (release)
timestamp 1683786363
update-time 1698067897
capabilities 0x0000007f

Logs

Nothing appears into dmesg about this device.

Output of lsmod:

Module                  Size  Used by
emc2305                16384  0
[...]

Additional context

I noticed from the kernel source code that the involved driver, which seems to be emc2305.c unless I'm wrong, is different between Raspberry Pi kernels and upstream kernels source code (even when using the same revision number, in this case 6.1.54 from linux-stable_20231004 here, or linux-6.1.54.tar.xz from kernel.org).

This is why I prefer reporting the issue here.

Hoping this report may help!

Best regards, Julien ROBIN

pelwell commented 10 months ago

Are there other values under /sys/class/hwmon/hwmon2/?

pelwell commented 10 months ago

And can you explain what you are trying to do by dynamically changing the pwm values? Normally you would configure the thermal zones and cooling fan settings and then leave the system to do its thing.

julienrobin28 commented 10 months ago

Yes there is other values from /sys/class/hwmon/hwmon2/

Output of ls /sys/class/hwmon/hwmon2/:

device  fan1_fault  fan1_input  name  of_node  power  pwm1  subsystem  uevent

Output of ls -an /sys/class/hwmon/hwmon2/:

total 0
drwxr-xr-x 3 0 0    0 Oct 24 20:19 .
drwxr-xr-x 3 0 0    0 Oct 24 20:19 ..
lrwxrwxrwx 1 0 0    0 Oct 24 21:31 device -> ../../../10-002f
-r--r--r-- 1 0 0 4096 Oct 24 21:31 fan1_fault
-r--r--r-- 1 0 0 4096 Oct 24 20:27 fan1_input
-r--r--r-- 1 0 0 4096 Oct 24 21:31 name
lrwxrwxrwx 1 0 0    0 Oct 24 21:31 of_node -> ../../../../../../../../../firmware/devicetree/base/soc/i2c0mux/i2c@1/emc2301@2f
drwxr-xr-x 2 0 0    0 Oct 24 21:31 power
-rw-r--r-- 1 0 0 4096 Oct 25 02:47 pwm1
lrwxrwxrwx 1 0 0    0 Oct 24 21:31 subsystem -> ../../../../../../../../../class/hwmon
-rw-r--r-- 1 0 0 4096 Oct 24 20:19 uevent

Even when pwm1 is stuck maxed out, reading of fan1_inputfile still works, and shows the current RPM fan speed (which keeps updating successfully). The lm-sensors package (sensor command) also works:

emc2305-i2c-10-2f
Adapter: i2c-22-mux (chan_id 1)
fan1:        1160 RPM

cpu_thermal-virtual-0
Adapter: Virtual device
temp1:        +44.3°C  (crit = +110.0°C)

rpi_volt-isa-0000
Adapter: ISA adapter
in0:              N/A  

What I'm trying to do is to use fancontrol Debian package, whose goal to dynamically increase or decrease the speed of my (very fast and very noisy) fan, by periodically:

The /etc/fancontrol file is interactively created by running pwmconfig, which helps identifying which pwm file controls which fan, by also increasing and decreasing pwm values.

(Is there another/prefered way to change thermal zones and cooling fan settings?) For now, as a workaround, I'm using a physical potentiometer/rheostat to manually set a fixed speed to my fan.

pelwell commented 10 months ago

Any suggestions, @6by9?

6by9 commented 10 months ago

This is all from the mainline driver. We only added back in DT configuration because the mainline DT maintainers wouldn't agree a binding.

I'd guess it's the lump at https://github.com/torvalds/linux/blob/master/drivers/hwmon/emc2305.c#L411-L419. Compare to the equivalent in pwm_fan and it just validates the range but otherwise accepts the data. There is a blob of text describing what they're trying to achieve at https://github.com/torvalds/linux/blob/master/drivers/hwmon/emc2305.c#L68-L80 and that sounds reasonable enough.

julienrobin28 commented 10 months ago

Many thanks @6by9 for this information and having a look at this.

Too bad the traditional behavior isn't available as an option! But of course, this now makes more sense, even if this implementation unfortunately isn't compatible with fancontrol usual way of workings with PWMs.

Just before letting you go, I have a tiny question:

Anyway, even if not ideal for me, this may be OK for me for my particular case, by just setting a reasonable value which won't be updated after 👍 I'll have to set a value anyway, as the initial (minimum and current) pwm1 value at driver initialization is 0 (at this speed, the fan isn't noisy at all, but it's not cooling neither 😅)

Thanks again

6by9 commented 10 months ago

I have a suspicion that it is a misbehaviour compared to that documented. I would have expected pwm1 to be adjustable such that it is just a lower limit.

I suspect your time window is down to the poll period of the thermal zone, and if the thermal zone has bumped up the speed then it's updated some other state which stops you changing the low point again.

You can disconnect the thermal zones by dropping fragments 104 & 105 in https://github.com/raspberrypi/linux/blob/rpi-6.1.y/arch/arm/boot/dts/overlays/i2c-fan-overlay.dts#L54-L82, and then I would expect the pwm1 control to set the speed directly.

julienrobin28 commented 10 months ago

By reading fragments 104 & 105 before removing it (I'll be doing this test just after), and searching for more information about what are those "thermal zones", I've discovered the existence of /sys/class/thermal/cooling_device0/ folder.

Which is probably the thermal zones settings both of you were talking about previously!

Looks like /sys/class/thermal/cooling_device0 works in both way

This other sysfs is successfully able to set the fan speed up and down, by using another way to change pwm1 value in both direction, using /sys/class/thermal/cooling_device0/cur_state (which goes from 0 to 10, according to /sys/class/thermal/cooling_device0/max_state).

By reading /sys/class/hwmon/hwmon2/pwm1 it turns out:

By the way: I am discovering that fancontrol isn't needed:

I guess this is what fragment 104 & 105 in i2c-fan-overlay.dts are doing: the /sys/class/thermal/cooling_device0/ folder is registered into /sys/class/thermal/thermal_zone0/ as a symbolic link (cdev0/ to ../cooling_device0/), and I found out that thermal_zone0, which is the CPU temperature, is periodically checked so that the fan speed is already periodically adjusted.

By running stress-ng --matrix 0 I indeed verified that the fan speed actually increases when the CPU is getting hotter. I wasn't aware that this was already done! My CPU wasn't working hard enough for me to notice this.

I'll do the little test about disconnecting the thermal zones by dropping fragments 104 & 105 in i2c-fan-overlay.dts and keep you informed about /sys/class/hwmon/hwmon2/pwm1 minimum value getting locked or not after few seconds.

julienrobin28 commented 10 months ago

So I can confirm that pwm1 minimum value isn't automatically locking anymore when having removed fragments 104 & 105 in i2c-fan-overlay.dts:

What I did / what the results are:

This confirms @6by9 statement about my time window being down to the poll period of the thermal zone.

Note: By removing the 2 fragments from i2c-fan-overlay.dts, none of the /sys/class/hwmon/hwmon2/ and /sys/class/thermal/cooling_device0/ were showing anymore (lsmod wasn't showing emc2305 driver as loaded anymore). I found a way to manually reload the driver by typing echo "emc2301" 0x2f > /sys/bus/i2c/devices/i2c-22/new_device

Doing so, both /sys/class/hwmon/hwmon2/ and /sys/class/thermal/cooling_device0/ are showing back (but cooling_device0 isn't linked anymore from thermal_zone0, as expected for this test).

I keep available if I can do anything else; thanks again for the work.