openwrt / openwrt

This repository is a mirror of https://git.openwrt.org/openwrt/openwrt.git It is for reference only and is not active for check-ins. We will continue to accept Pull Requests here. They will be merged via staging trees then into openwrt.git.
Other
19.61k stars 10.25k forks source link

MT7623 devices are unusable under even very light loads #9396

Open VA1DER opened 2 years ago

VA1DER commented 2 years ago

PROBLEM: MT7623 SoC devices (UniElec U7623 and Banana PI R2) are currently set to scale their CPU on passive cooling when they reach only 47°C. This is a temperature that is: a) very easy to reach under even light loads, and b) not dangerous to the device.

RAMIFICATION: It means that under even relatively light loads, the CPU frequency is scaled back rather drastically. When all four cores are in use, the frequency is scaled down to 98MHz, which essentially renders the devices unusable under any real load. All four cores operating together at 98MHz are sufficient to KEEP the CPU at >47°C, which then means the whole device bogs down.

CAUSE: The problem stems from what appears to be an incorrectly set up dts file for the MT7623. This file defines four thermal trip points (lines 161-185 in the file):

This appears to be in error, or at least suboptimal. It makes little sense for the kernel to enlist passive remediation (CPU scaling) before active. Generally one would want a CPU's fan to engage BEFORE that CPU's speed is scaled back into uselessness.

RECOMMENDATION: Recommend that OpenWrt adopt a patch to reverse the types of trip points 0 and 1, so that 0 becomes active, and 1 becomes passive. This is, in my view, likely the originally intended sequence.

It also makes more sense from a cooling perspective, where any active cooling (if present) should be activated before CPU scaling in employed. And it makes more sense from a temperature perspective, where 47°C is a good temp to turn a fan on at since it's a temp most CPUs won't reach under no load, but is a temperature easily attained under even minimal load. So the fan will engage early in a loading environment. And 67°C is a good temp to start CPU scaling at.

VA1DER commented 2 years ago

mt7623_passive_active_swap.diff.gz

--- mt7623.dtsi.orig    2022-03-04 14:26:20.883787720 -0400
+++ mt7623.dtsi 2022-03-04 15:00:06.385349555 -0400
@@ -160,17 +160,17 @@

                trips {
                    cpu_passive: cpu-passive {
                        temperature = <47000>;
                        hysteresis = <2000>;
-                       type = "passive";
+                       type = "active";
                    };

                    cpu_active: cpu-active {
                        temperature = <67000>;
                        hysteresis = <2000>;
-                       type = "active";
+                       type = "passive";
                    };

                    cpu_hot: cpu-hot {
                        temperature = <87000>;
                        hysteresis = <2000>;
namidairo commented 2 years ago

I checked upstream and they've upped the trip point to 57000, so it may just be better to backport this patch from 5.15 here, citing the same reasons you have.

VA1DER commented 2 years ago

While I prefer my solution (as passive cooling at any temperature before active just doesn't make sense), I would tend to agree that adopting upstream's solution is more sustainable.

I'll work with upstream to try and get a more optimal solution there.

dangowrt commented 2 years ago

@frank-w Shouldn't this be fixed in upstream Kernel DTS? The order of the trip point types doesn't make much sense as @VA1DER correctly stated...

frank-w commented 2 years ago

Of course...i've sent a patch to increase pasive cooling trip to 57°C,but it was not accepted or commented by MTK for a better solution. Btw.mt7622 has same problem.here another user sent a patch to disable the cpu trotteling in lower 2 trips.

https://lore.kernel.org/linux-arm-kernel/20210725163451.217610-1-linux@fw-web.de/

https://lore.kernel.org/linux-arm-kernel/20210619121927.32699-1-ericwouds@gmail.com/

VA1DER commented 2 years ago

I made a mistake in my original report. There are 4 thermal trip points, and the "active" is indeed before "passive", but what I didn't notice is that the first three trip points (passive, active, and hot) all point to identical cooling maps shown below (trimmed for brevity):

                map0 {
                    trip = <&cpu_passive>;
                    cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,

                map1 {
                    trip = <&cpu_active>;
                    cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,

                map2 {
                    trip = <&cpu_hot>;
                    cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,

As you can see, each cooling map has the same cooling device, namely passive CPU scaling with no limits on how much scaling it can employ. This makes even less sense than I thought before, since the later two trips can not possibly have any effect.

So just switching "passive" and "active" as I originally proposed has no effect, since they both point to identical cooling maps. I suspect early in this device's development at Mediatek that they did contemplate active cooling support, but that it was deleted.

We can try and get fancy and put scaling limits on the earlier trip points. Alternatively, since the .dts file implies that CPU speed throttling can be done on a per-core basis, we can also experiment with the earlier trips allowing throttling only on one or two cores. But really I think simpler is better. I personally just deleted one trip entirely and its associated cooling map and am running with passive, hot, and critical trips, with passive starting at 67°C. It's working well on my devices right now. But to disable throttling on the lower two trips is a good solution too, since their temperatures (47°C and 67°C) are not really dangerous temperatures.

VA1DER commented 2 years ago

I checked upstream and they've upped the trip point to 57000

Even 57°C is an inappropriately low temperature to be throttling the CPU at. It is too low to allow proper operation, it is far below any danger point, and thus upstream's solution is insufficient. I have both a BPI R2 and U7623 board, and when they are mounted in cases with normally employed wi-fi cards, the internal temperature cannot be kept under 57°C in almost any real-world circumstance.

As noted above, the CPU will be throttled all the way down to 98MHz if the temperature rises even a degree above the trip point, and I have further discovered that if the internal temperature of the device is above the first trip point temperature when it boots, then it will start in a throttled state and even $ echo disabled > /sys/class/thermal/thermal_zone0/mode will have no effect. There is no way I know of to manually unthrottle the CPU once it starts.

My recommended solution (and what I employ on all affected Mediatek boards) is to simply delete the first two trip points and cooling maps. The throttling temperature will then be at 87°C, which is still a low enough temperature for ARM devices to not be in the real danger zone, and gives some operational headroom. This will also be required for the MT7622 (BPI R64) board as well.

mt7623_thermal_zone_fix.diff.gz

--- openwrt/build_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/linux-mediatek_mt7623/linux-5.4.179/arch/arm/boot/dts/mt7623.dtsi       2022-04-01 14:55:58.491324235 -0300
+++ mt7623.dtsi 2022-04-01 14:49:57.872947359 -0300
@@ -157,22 +157,10 @@
                                polling-delay = <1000>;

                                thermal-sensors = <&thermal 0>;

                                trips {
-                                       cpu_passive: cpu-passive {
-                                               temperature = <47000>;
-                                               hysteresis = <2000>;
-                                               type = "passive";
-                                       };
-
-                                       cpu_active: cpu-active {
-                                               temperature = <67000>;
-                                               hysteresis = <2000>;
-                                               type = "active";
-                                       };
-
                                        cpu_hot: cpu-hot {
                                                temperature = <87000>;
                                                hysteresis = <2000>;
                                                type = "hot";
                                        };
@@ -184,26 +172,10 @@
                                        };
                                };

                        cooling-maps {
                                map0 {
-                                       trip = <&cpu_passive>;
-                                       cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
-                                                        <&cpu1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
-                                                        <&cpu2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
-                                                        <&cpu3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
-                               };
-
-                               map1 {
-                                       trip = <&cpu_active>;
-                                       cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
-                                                        <&cpu1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
-                                                        <&cpu2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
-                                                        <&cpu3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
-                               };
-
-                               map2 {
                                        trip = <&cpu_hot>;
                                        cooling-device = <&cpu0 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
                                                         <&cpu1 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
                                                         <&cpu2 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>,
                                                         <&cpu3 THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
frank-w commented 2 years ago

Just send it upstream for both soc and look if you get any comments.

anonimou0 commented 2 years ago

My workaround was to create a scrip on /etc/init.d/mt7623_fix with

echo disabled > /sys/class/thermal/thermal_zone0/mode .

After that, I could get an usable device.

Thank you @frank-w and @VA1DER

VA1DER commented 2 years ago

My workaround

That is the workaround I used to use, but it has two important limitations (which is now why I use a patched kernel)

  1. This inhibits ALL further trigger processing. So, at the really high temperatures where you still want thermal protection, you have none.
  2. That workaround only inhibits the processing of triggers. It doesn't inhibit thermal scaling. So if, for example, the device is already at 47 degrees when it boots (which is not a difficult temperature to get to just by having the sun shining on the black box of a Banana Pi R2), then the first trigger will get tripped before triggers get disabled, and the scaling will already be in effect and then locked there. Your CPU will be locked at 98MHz until you turn the device off, get it below 47 degrees, and then start it.
dangowrt commented 2 years ago

@VA1DER Can you please post a patch doing both, swapping active and passive trip-points and if needed also lifting the trip points of both a bit? Ideally you send it to kernel mailing list, but even just sending it to OpenWrt list, tested and with a patch description and SoB line would be nice.

anonimou0 commented 1 year ago

So, I don't exactly WHAT is doing it, but it seems that the smaller trip points are not being triggered.

https://github.com/openwrt/openwrt/pull/9778 just got merged and I decided to try the current release 22.03.1, which does not have the patch, as it was built before that was merged.

And the trip points are not being triggered. I'm running stress to try to trigger it without luck:

root@OpenWrt:~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_frequencies
98000 198000 398000 598000 747500 1040000 1196000 1300000 
98000 198000 398000 598000 747500 1040000 1196000 1300000 
98000 198000 398000 598000 747500 1040000 1196000 1300000 
98000 198000 398000 598000 747500 1040000 1196000 1300000 
root@OpenWrt::~# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
1300000
1300000
1300000
1300000
root@OpenWrt:~# cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq
1300000
1300000
1300000
1300000
root@OpenWrt:~# cat /sys/class/thermal/thermal_zone0/trip_point_*_temp 
47000
67000
87000
107000
root@OpenWrt:~# cat /sys/class/thermal/thermal_zone0/temp 
68190

I've tested this on two different devices, getting the same behavior.

Notice that the trip point says 47oC and 67oC and even with higher temperatures, the CPU is not throttled.

What is more weird is that the kernel says 57oC and 67oC: https://github.com/torvalds/linux/blob/master/arch/arm/boot/dts/mt7623.dtsi#L163