Open VA1DER opened 2 years ago
mt7623_passive_active_swap.diff.gz
--- mt7623.dtsi.orig 2022-03-04 14:26:20.883787720 -0400
+++ mt7623.dtsi 2022-03-04 15:00:06.385349555 -0400
@@ -160,17 +160,17 @@
trips {
cpu_passive: cpu-passive {
temperature = <47000>;
hysteresis = <2000>;
- type = "passive";
+ type = "active";
};
cpu_active: cpu-active {
temperature = <67000>;
hysteresis = <2000>;
- type = "active";
+ type = "passive";
};
cpu_hot: cpu-hot {
temperature = <87000>;
hysteresis = <2000>;
I checked upstream and they've upped the trip point to 57000, so it may just be better to backport this patch from 5.15 here, citing the same reasons you have.
While I prefer my solution (as passive cooling at any temperature before active just doesn't make sense), I would tend to agree that adopting upstream's solution is more sustainable.
I'll work with upstream to try and get a more optimal solution there.
@frank-w Shouldn't this be fixed in upstream Kernel DTS? The order of the trip point types doesn't make much sense as @VA1DER correctly stated...
Of course...i've sent a patch to increase pasive cooling trip to 57°C,but it was not accepted or commented by MTK for a better solution. Btw.mt7622 has same problem.here another user sent a patch to disable the cpu trotteling in lower 2 trips.
https://lore.kernel.org/linux-arm-kernel/20210725163451.217610-1-linux@fw-web.de/
https://lore.kernel.org/linux-arm-kernel/20210619121927.32699-1-ericwouds@gmail.com/
I made a mistake in my original report. There are 4 thermal trip points, and the "active" is indeed before "passive", but what I didn't notice is that the first three trip points (passive, active, and hot) all point to identical cooling maps shown below (trimmed for brevity):
map0 {
trip = <&cpu_passive>;
cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
map1 {
trip = <&cpu_active>;
cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
map2 {
trip = <&cpu_hot>;
cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
As you can see, each cooling map has the same cooling device, namely passive CPU scaling with no limits on how much scaling it can employ. This makes even less sense than I thought before, since the later two trips can not possibly have any effect.
So just switching "passive" and "active" as I originally proposed has no effect, since they both point to identical cooling maps. I suspect early in this device's development at Mediatek that they did contemplate active cooling support, but that it was deleted.
We can try and get fancy and put scaling limits on the earlier trip points. Alternatively, since the .dts file implies that CPU speed throttling can be done on a per-core basis, we can also experiment with the earlier trips allowing throttling only on one or two cores. But really I think simpler is better. I personally just deleted one trip entirely and its associated cooling map and am running with passive, hot, and critical trips, with passive starting at 67°C. It's working well on my devices right now. But to disable throttling on the lower two trips is a good solution too, since their temperatures (47°C and 67°C) are not really dangerous temperatures.
I checked upstream and they've upped the trip point to 57000
Even 57°C is an inappropriately low temperature to be throttling the CPU at. It is too low to allow proper operation, it is far below any danger point, and thus upstream's solution is insufficient. I have both a BPI R2 and U7623 board, and when they are mounted in cases with normally employed wi-fi cards, the internal temperature cannot be kept under 57°C in almost any real-world circumstance.
As noted above, the CPU will be throttled all the way down to 98MHz if the temperature rises even a degree above the trip point, and I have further discovered that if the internal temperature of the device is above the first trip point temperature when it boots, then it will start in a throttled state and even $ echo disabled > /sys/class/thermal/thermal_zone0/mode
will have no effect. There is no way I know of to manually unthrottle the CPU once it starts.
My recommended solution (and what I employ on all affected Mediatek boards) is to simply delete the first two trip points and cooling maps. The throttling temperature will then be at 87°C, which is still a low enough temperature for ARM devices to not be in the real danger zone, and gives some operational headroom. This will also be required for the MT7622 (BPI R64) board as well.
mt7623_thermal_zone_fix.diff.gz
--- openwrt/build_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/linux-mediatek_mt7623/linux-5.4.179/arch/arm/boot/dts/mt7623.dtsi 2022-04-01 14:55:58.491324235 -0300
+++ mt7623.dtsi 2022-04-01 14:49:57.872947359 -0300
@@ -157,22 +157,10 @@
polling-delay = <1000>;
thermal-sensors = <&thermal 0>;
trips {
- cpu_passive: cpu-passive {
- temperature = <47000>;
- hysteresis = <2000>;
- type = "passive";
- };
-
- cpu_active: cpu-active {
- temperature = <67000>;
- hysteresis = <2000>;
- type = "active";
- };
-
cpu_hot: cpu-hot {
temperature = <87000>;
hysteresis = <2000>;
type = "hot";
};
@@ -184,26 +172,10 @@
};
};
cooling-maps {
map0 {
- trip = <&cpu_passive>;
- cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
- <&cpu1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
- <&cpu2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
- <&cpu3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
- };
-
- map1 {
- trip = <&cpu_active>;
- cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
- <&cpu1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
- <&cpu2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
- <&cpu3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
- };
-
- map2 {
trip = <&cpu_hot>;
cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
<&cpu1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
<&cpu2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
<&cpu3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
Just send it upstream for both soc and look if you get any comments.
My workaround was to create a scrip on /etc/init.d/mt7623_fix
with
echo disabled > /sys/class/thermal/thermal_zone0/mode
.
After that, I could get an usable device.
Thank you @frank-w and @VA1DER
My workaround
That is the workaround I used to use, but it has two important limitations (which is now why I use a patched kernel)
@VA1DER Can you please post a patch doing both, swapping active
and passive
trip-points and if needed also lifting the trip points of both a bit? Ideally you send it to kernel mailing list, but even just sending it to OpenWrt list, tested and with a patch description and SoB line would be nice.
So, I don't exactly WHAT is doing it, but it seems that the smaller trip points are not being triggered.
https://github.com/openwrt/openwrt/pull/9778 just got merged and I decided to try the current release 22.03.1, which does not have the patch, as it was built before that was merged.
And the trip points are not being triggered. I'm running stress
to try to trigger it without luck:
root@OpenWrt:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies
98000 198000 398000 598000 747500 1040000 1196000 1300000
98000 198000 398000 598000 747500 1040000 1196000 1300000
98000 198000 398000 598000 747500 1040000 1196000 1300000
98000 198000 398000 598000 747500 1040000 1196000 1300000
root@OpenWrt::~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
1300000
1300000
1300000
1300000
root@OpenWrt:~# cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
1300000
1300000
1300000
1300000
root@OpenWrt:~# cat /sys/class/thermal/thermal_zone0/trip_point_*_temp
47000
67000
87000
107000
root@OpenWrt:~# cat /sys/class/thermal/thermal_zone0/temp
68190
I've tested this on two different devices, getting the same behavior.
Notice that the trip point says 47oC and 67oC and even with higher temperatures, the CPU is not throttled.
What is more weird is that the kernel says 57oC and 67oC: https://github.com/torvalds/linux/blob/master/arch/arm/boot/dts/mt7623.dtsi#L163
PROBLEM: MT7623 SoC devices (UniElec U7623 and Banana PI R2) are currently set to scale their CPU on passive cooling when they reach only 47°C. This is a temperature that is: a) very easy to reach under even light loads, and b) not dangerous to the device.
RAMIFICATION: It means that under even relatively light loads, the CPU frequency is scaled back rather drastically. When all four cores are in use, the frequency is scaled down to 98MHz, which essentially renders the devices unusable under any real load. All four cores operating together at 98MHz are sufficient to KEEP the CPU at >47°C, which then means the whole device bogs down.
CAUSE: The problem stems from what appears to be an incorrectly set up dts file for the MT7623. This file defines four thermal trip points (lines 161-185 in the file):
This appears to be in error, or at least suboptimal. It makes little sense for the kernel to enlist passive remediation (CPU scaling) before active. Generally one would want a CPU's fan to engage BEFORE that CPU's speed is scaled back into uselessness.
RECOMMENDATION: Recommend that OpenWrt adopt a patch to reverse the types of trip points 0 and 1, so that 0 becomes active, and 1 becomes passive. This is, in my view, likely the originally intended sequence.
It also makes more sense from a cooling perspective, where any active cooling (if present) should be activated before CPU scaling in employed. And it makes more sense from a temperature perspective, where 47°C is a good temp to turn a fan on at since it's a temp most CPUs won't reach under no load, but is a temperature easily attained under even minimal load. So the fan will engage early in a loading environment. And 67°C is a good temp to start CPU scaling at.