raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.19k stars 1.68k forks source link

[Kernel5.4] Lowering arm_freq_min leads to system hang/crash #1431

Open MichaIng opened 4 years ago

MichaIng commented 4 years ago

Describe the bug I was upgrading to the newest firmware + kernel packages, which resulted in system hangs and/or crashes. I narrowed down the issue to arm_freq_min which I lowered to 150 or 300 (tested both) to allow the system clocking below 600 Mhz. Commenting the setting leads to a stable system, setting/reducing it leads to a quickly hanging or crashing system.

To reproduce

  1. Upgrade the kernel on Raspberry Pi 2 Model B Rev 1.1 to current package release 5.4.51-v7+.
  2. Set arm_freq_min to 300 (gpu_mem=16, if relevant)
  3. reboot
  4. play around in the file system, which some executables (like htop) until it either hangs or crashes. The last time I triggered it with vcgencmd measure_clock gpu.

Expected behaviour Add a clear and concise description of what you expected to happen.

Actual behaviour Setting arm_freq_min to 300 should not lead to system crashes.

System Copy and paste the results of the raspinfo command in to this section. Alternatively, copy and paste a pastebin link, or add answers to the following questions:

Logs

[  189.433811] 8<--- cut here ---
[  189.433874] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[  189.433930] pgd = 89d1d828
[  189.433974] [0000000c] *pgd=36ce9835, *pte=00000000, *ppte=00000000
[  189.434029] Internal error: Oops: 17 [#1] SMP ARM
[  189.434055] Modules linked in:
[  189.434089] CPU: 3 PID: 487 Comm: bash Not tainted 5.4.51-v7+ #1326
[  189.434116] Hardware name: BCM2835
[  189.434151] PC is at filemap_map_pages+0x118/0x448
[  189.434181] LR is at filemap_map_pages+0x43c/0x448
[  189.434207] pc : [<80272f9c>]    lr : [<802732c0>]    psr: 80000113
[  189.434235] sp : b82afe40  ip : b82afe40  fp : b82afe9c
[  189.434261] r10: b8e9d868  r9 : b82afeb4  r8 : 000000c3
[  189.434289] r7 : 80d04f48  r6 : 00000406  r5 : 000000cf  r4 : ba319480
[  189.434318] r3 : 00000004  r2 : 000000c3  r1 : b9406130  r0 : 00000008
[  189.434350] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[  189.434382] Control: 10c5387d  Table: 3832006a  DAC: 00000055
[  189.434412] Process bash (pid: 487, stack limit = 0x41237b4b)
[  189.434441] Stack: (0xb82afe40 to 0xb82b0000)
[  189.434475] fe40: 000000c3 00000000 b82afeb4 b832ba80 80d04f48 b8e9d86c 000000c3 00030000
[  189.434518] fe60: b9406130 00000000 00000000 b19558f1 00000001 000000cf 80d04f48 000000c0
[  189.434561] fe80: b82afeb4 000d7000 00000007 00000000 b82aff1c b82afea0 802ab58c 80272e90
[  189.434604] fea0: 8010bbbc 80d04f48 00000000 000d7cfc b82afef4 b6d06360 00000054 00100cca
[  189.434646] fec0: 000000c7 000d3000 b8320000 b8320000 00000000 00000000 00000000 00000000
[  189.434690] fee0: b6ce934c ba309d34 00000000 b19558f1 b82aff1c b82affb0 b6c88000 000d7cfc
[  189.434733] ff00: b8833400 00000017 b8833440 00000000 b82aff74 b82aff20 808d37cc 802aadd4
[  189.434776] ff20: 8012fcb0 8020b210 b82aff54 b82aff38 8010cfdc 8010cf4c 8010bbbc 80d04f48
[  189.434819] ff40: 00000000 00000054 b82aff8c 80d0a7d4 00000017 808d34ac 000d7cfc b82affb0
[  189.434862] ff60: 000d7110 00000000 b82affac b82aff78 80115854 808d34b8 80101d1c 00000001
[  189.434905] ff80: b82affac b82aff90 8020b3fc 00071ae0 20000010 ffffffff 10c5387d 10c5387d
[  189.434948] ffa0: 00000000 b82affb0 80101d24 80115818 00000014 00000000 000d7cfc 00000000
[  189.434991] ffc0: 00000014 000fe744 00103f60 000001e7 00d9b0e8 000d7110 00000000 001052ac
[  189.435035] ffe0: 000fe078 7eb3b408 00054684 00071ae0 20000010 ffffffff 00000000 00000000
[  189.435067] Backtrace:
[  189.435104] [<80272e84>] (filemap_map_pages) from [<802ab58c>] (handle_mm_fault+0x7c4/0xa4c)
[  189.435151]  r10:00000000 r9:00000007 r8:000d7000 r7:b82afeb4 r6:000000c0 r5:80d04f48
[  189.435186]  r4:000000cf
[  189.435220] [<802aadc8>] (handle_mm_fault) from [<808d37cc>] (do_page_fault+0x320/0x3a8)
[  189.435266]  r10:00000000 r9:b8833440 r8:00000017 r7:b8833400 r6:000d7cfc r5:b6c88000
[  189.435301]  r4:b82affb0
[  189.435333] [<808d34ac>] (do_page_fault) from [<80115854>] (do_DataAbort+0x48/0xc4)
[  189.435377]  r10:00000000 r9:000d7110 r8:b82affb0 r7:000d7cfc r6:808d34ac r5:00000017
[  189.435412]  r4:80d0a7d4
[  189.435443] [<8011580c>] (do_DataAbort) from [<80101d24>] (__dabt_usr+0x44/0x60)
[  189.435478] Exception stack(0xb82affb0 to 0xb82afff8)
[  189.435508] ffa0:                                     00000014 00000000 000d7cfc 00000000
[  189.435551] ffc0: 00000014 000fe744 00103f60 000001e7 00d9b0e8 000d7110 00000000 001052ac
[  189.435593] ffe0: 000fe078 7eb3b408 00054684 00071ae0 20000010 ffffffff
[  189.435629]  r8:10c5387d r7:10c5387d r6:ffffffff r5:20000010 r4:00071ae0
[  189.435666] Code: 0a00000d e2830005 e2833001 e0810100 (e5904004)
[  189.435719] ---[ end trace a011ff3c127a31f8 ]---
[  210.449124] rcu: INFO: rcu_sched self-detected stall on CPU
[  210.449193] rcu:     3-....: (2099 ticks this GP) idle=e0a/1/0x40000002 softirq=1262/1262 fqs=1049
[  210.449234]  (t=2100 jiffies g=1173 q=148)
[  210.449261] NMI backtrace for cpu 3
[  210.449293] CPU: 3 PID: 487 Comm: bash Tainted: G      D           5.4.51-v7+ #1326
[  210.449326] Hardware name: BCM2835
[  210.449347] Backtrace:
[  210.449391] [<8010d458>] (dump_backtrace) from [<8010d750>] (show_stack+0x20/0x24)
[  210.449434]  r6:b82ae000 r5:00000000 r4:80d93ff4 r3:b19558f1
[  210.449473] [<8010d730>] (show_stack) from [<808b22a4>] (dump_stack+0xe0/0x124)
[  210.449518] [<808b21c4>] (dump_stack) from [<808b9b58>] (nmi_cpu_backtrace+0xc8/0xcc)
[  210.449562]  r8:00000140 r7:8090202c r6:00000003 r5:00000000 r4:00000003 r3:b19558f1
[  210.449606] [<808b9a90>] (nmi_cpu_backtrace) from [<808b9c58>] (nmi_trigger_cpumask_backtrace+0xfc/0x138)
[  210.449647]  r5:80d07c8c r4:8010f340
[  210.449682] [<808b9b5c>] (nmi_trigger_cpumask_backtrace) from [<80110560>] (arch_trigger_cpumask_backtrace+0x20/0x24)
[  210.449728]  r7:80d05004 r6:80000193 r5:8090201c r4:00000003
[  210.449769] [<80110540>] (arch_trigger_cpumask_backtrace) from [<80195084>] (rcu_dump_cpu_stacks+0xb4/0xe4)
[  210.449821] [<80194fd0>] (rcu_dump_cpu_stacks) from [<80194758>] (rcu_sched_clock_irq+0x868/0xa80)
[  210.449868]  r10:3d91c000 r9:80dd1018 r8:80d04ff4 r7:80ca2ec0 r6:be5beec0 r5:80da4b64
[  210.449905]  r4:80d109c0 r3:ffffdd04
[  210.449942] [<80193ef0>] (rcu_sched_clock_irq) from [<8019dbb4>] (update_process_times+0x3c/0x64)
[  210.449989]  r10:801b11f8 r9:be5b85f0 r8:be5b8540 r7:00000030 r6:ff33a1c1 r5:b82afae0
[  210.450023]  r4:00000000
[  210.450057] [<8019db78>] (update_process_times) from [<801b0950>] (tick_sched_handle+0x64/0x70)
[  210.450096]  r4:be5b8870 r3:20000113
[  210.450129] [<801b08ec>] (tick_sched_handle) from [<801b1254>] (tick_sched_timer+0x5c/0xb8)
[  210.450176] [<801b11f8>] (tick_sched_timer) from [<8019eb34>] (__hrtimer_run_queues+0x164/0x324)
[  210.450219]  r7:b82ae000 r6:be5b8540 r5:be5b8580 r4:be5b8870
[  210.450259] [<8019e9d0>] (__hrtimer_run_queues) from [<8019f548>] (hrtimer_interrupt+0x130/0x2a4)
[  210.450306]  r10:be5b85c8 r9:be5b85f0 r8:be5b8540 r7:ffffffff r6:7fffffff r5:00000003
[  210.450340]  r4:20000193
[  210.450377] [<8019f418>] (hrtimer_interrupt) from [<8071dee8>] (arch_timer_handler_phys+0x40/0x48)
[  210.450424]  r10:b82afc68 r9:b82ae000 r8:80d63338 r7:b98aad00 r6:000000a2 r5:b9802fc0
[  210.450459]  r4:80ca22a4
[  210.450495] [<8071dea8>] (arch_timer_handler_phys) from [<80186e40>] (handle_percpu_devid_irq+0x88/0x23c)
[  210.450546] [<80186db8>] (handle_percpu_devid_irq) from [<80180384>] (generic_handle_irq+0x34/0x44)
[  210.450594]  r9:b82ae000 r8:b989d000 r7:00000001 r6:00000000 r5:00000000 r4:80ca22a4
[  210.450638] [<80180350>] (generic_handle_irq) from [<80180ad0>] (__handle_domain_irq+0x6c/0xc4)
[  210.450687] [<80180a64>] (__handle_domain_irq) from [<80102228>] (bcm2836_arm_irqchip_handle_irq+0x60/0xa4)
[  210.450735]  r8:b82afc68 r7:b82afb14 r6:ffffffff r5:20000113 r4:00000003 r3:b82afae0
[  210.450779] [<801021c8>] (bcm2836_arm_irqchip_handle_irq) from [<80101a3c>] (__irq_svc+0x5c/0x7c)
[  210.450818] Exception stack(0xb82afae0 to 0xb82afb28)
[  210.450853] fae0: ba309d34 00000000 0000000e 0000000d 00011000 b6ce9040 000ee000 80b15000
[  210.450897] fb00: b82afc68 00010000 b82afc68 b82afb3c b82afb40 b82afb30 802a8a70 808d326c
[  210.450933] fb20: 20000113 ffffffff
[  210.450958]  r4:808d326c r3:b19558f1
[  210.450993] [<808d322c>] (_raw_spin_lock) from [<802a8a70>] (unmap_page_range+0x190/0x734)
[  210.451039] [<802a88e0>] (unmap_page_range) from [<802a9060>] (unmap_single_vma+0x4c/0x54)
[  210.451085]  r10:b6c88000 r9:b8833440 r8:00000000 r7:00000000 r6:b82afc68 r5:ffffffff
[  210.451119]  r4:b6d06360
[  210.451149] [<802a9014>] (unmap_single_vma) from [<802a91e0>] (unmap_vmas+0x64/0x78)
[  210.451195] [<802a917c>] (unmap_vmas) from [<802af958>] (exit_mmap+0xdc/0x178)
[  210.451237]  r8:0000000b r7:00000001 r6:00000000 r5:80d04f48 r4:b6d06360
[  210.451276] [<802af87c>] (exit_mmap) from [<8011c734>] (mmput+0x58/0x108)
[  210.451309]  r6:00000000 r5:00000000 r4:b8833400
[  210.451345] [<8011c6dc>] (mmput) from [<80123170>] (do_exit+0x364/0xb20)
[  210.451376]  r5:b8833400 r4:b6c88000
[  210.451409] [<80122e0c>] (do_exit) from [<8010d9a8>] (die+0x254/0x358)
[  210.451437]  r7:7f000000
[  210.451468] [<8010d754>] (die) from [<801159ec>] (__do_kernel_fault.part.0+0x88/0x98)
[  210.451513]  r10:00000000 r9:b8833440 r8:00000017 r7:b8833400 r6:b82afdf0 r5:00000017
[  210.451547]  r4:0000000c
[  210.451579] [<80115964>] (__do_kernel_fault.part.0) from [<808d3848>] (do_page_fault+0x39c/0x3a8)
[  210.451618]  r7:b8833400 r3:b82afdf0
[  210.451650] [<808d34ac>] (do_page_fault) from [<80115854>] (do_DataAbort+0x48/0xc4)
[  210.451694]  r10:b8e9d868 r9:b82ae000 r8:b82afdf0 r7:0000000c r6:808d34ac r5:00000017
[  210.451729]  r4:80d0a7d4
[  210.451758] [<8011580c>] (do_DataAbort) from [<801019b4>] (__dabt_svc+0x54/0x80)
[  210.451793] Exception stack(0xb82afdf0 to 0xb82afe38)
[  210.451823] fde0:                                     00000008 b9406130 000000c3 00000004
[  210.451867] fe00: ba319480 000000cf 00000406 80d04f48 000000c3 b82afeb4 b8e9d868 b82afe9c
[  210.451908] fe20: b82afe40 b82afe40 802732c0 80272f9c 80000113 ffffffff
[  210.451944]  r8:000000c3 r7:b82afe24 r6:ffffffff r5:80000113 r4:80272f9c
[  210.451984] [<80272e84>] (filemap_map_pages) from [<802ab58c>] (handle_mm_fault+0x7c4/0xa4c)
[  210.452030]  r10:00000000 r9:00000007 r8:000d7000 r7:b82afeb4 r6:000000c0 r5:80d04f48
[  210.452064]  r4:000000cf
[  210.452095] [<802aadc8>] (handle_mm_fault) from [<808d37cc>] (do_page_fault+0x320/0x3a8)
[  210.452140]  r10:00000000 r9:b8833440 r8:00000017 r7:b8833400 r6:000d7cfc r5:b6c88000
[  210.452174]  r4:b82affb0
[  210.452205] [<808d34ac>] (do_page_fault) from [<80115854>] (do_DataAbort+0x48/0xc4)
[  210.452249]  r10:00000000 r9:000d7110 r8:b82affb0 r7:000d7cfc r6:808d34ac r5:00000017
[  210.452283]  r4:80d0a7d4
[  210.452313] [<8011580c>] (do_DataAbort) from [<80101d24>] (__dabt_usr+0x44/0x60)
[  210.452348] Exception stack(0xb82affb0 to 0xb82afff8)
[  210.452378] ffa0:                                     00000014 00000000 000d7cfc 00000000
[  210.452421] ffc0: 00000014 000fe744 00103f60 000001e7 00d9b0e8 000d7110 00000000 001052ac
[  210.452462] ffe0: 000fe078 7eb3b408 00054684 00071ae0 20000010 ffffffff
[  210.452498]  r8:10c5387d r7:10c5387d r6:ffffffff r5:20000010 r4:00071ae0

Additional context

2020-07-21 19:43:53 root@micha:~# cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_frequencies
300000 360000 450000 600000 900000

This is new and probably the reason for the crashes when lowering minimum frequency. When leaving at 600, there are only two pstates 600 and 900 and with kernel 4.19 there are always only two. I was actually hoping for that feature, so great work, however sadly at least my RPi model does not work fine with it.

EmilyNerdGirl commented 3 years ago

arm_freq_max is ineffective expectedly, arm_freq is the maximum frequency already.

The Raspberry Pi Zero has a default of 1000 MHz, so if that causes issues, e.g. when you simply remove or comment the two lines and it fails to reboot, then there seems to be an issue with the hardware. But that is not related to the arm_freq_min topic, this issue is about .

Just realized I crossed streams on github issues, deleted it :facepalm:

ForceConstant commented 3 years ago

I have a Raspi 3B+, and have noticed that I have noticied that changes to arm_freq_min below 600 have no affect, and see that there was a previous workaround forcing this, but I did a recent rpi-update, and still it persists. Should I be able to set min frequency below 600 now?

popcornmix commented 3 years ago

It's not something currently supported. In general the clock gating of the arm core is pretty good and the benefits are minor when lowering arm clock below 600MHz

MichaIng commented 3 years ago

Then this should be documented to avoid confusion. The setting can still be used to raise the minimum frequency, although that can be done via CPUfreq as well.

JamesH65 commented 3 years ago

Documentation has been updated https://github.com/raspberrypi/documentation/commit/66e749c4ec540381062b36d06052a89cc4920ac2

TheyKilledKenny commented 3 years ago

It's not something currently supported. In general the clock gating of the arm core is pretty good and the benefits are minor when lowering arm clock below 600MHz

The benefit on the working temperature is enough to make the difference between using or not using a heat sink with fan, in our case. So, as we know the hardware is able to manage it (Ras Pi2 V1.2), please try to allow us to use lower frequencies. I hope this issue can be resolved asap.

MichaIng commented 3 years ago

On Raspberry Pi 4, arm_freq_min still works to reduce the minimal frequency down to 100 MHz. Interestingly it seems to cause issues there as well: https://github.com/MichaIng/DietPi/issues/4455#issuecomment-853245807

Just a single case for now, but probably the same/similar underlying issue.

C0D3-M4513R commented 3 years ago

I am doing some more tests, this time with ahk timing the execution of the date command, to 5 seconds. If longer tests are required, I will gladly do a minute timing each! oc is none/normal

root@DietPi:~# echo 300000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
root@DietPi:~# vcgencmd measure_clock arm
frequency(48)=300111328
root@DietPi:~# date +%H:%M:%S:%N
13:50:02:917133289
root@DietPi:~# date +%H:%M:%S:%N
13:50:07:943119323
root@DietPi:~# echo 200000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
root@DietPi:~# vcgencmd measure_clock arm
frequency(48)=200074224
root@DietPi:~# date +%H:%M:%S:%N
13:50:15:251591411
root@DietPi:~# date +%H:%M:%S:%N
13:50:19:534279556
root@DietPi:~# echo 100000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
root@DietPi:~# vcgencmd measure_clock arm
frequency(48)=100037112
root@DietPi:~# date +%H:%M:%S:%N
13:50:26:282622460
root@DietPi:~# date +%H:%M:%S:%N
13:50:26:654301830
root@DietPi:~#

I used the following ahk script:

#NoEnv  ; Recommended for performance and compatibility with future AutoHotkey releases.
; #Warn  ; Enable warnings to assist with detecting common errors.
SendMode Input  ; Recommended for new scripts due to its superior speed and reliability.
SetWorkingDir %A_ScriptDir%  ; Ensures a consistent starting directory.

#MaxThreadsPerHotkey 2
SetBatchLines -1

F17::
    SendInput {Raw}vcgencmd measure_clock arm`n
    ;Wait for command output
    Sleep, 500
    SendInput {Raw}date +`%H:`%M:`%S:`%N`n
    Sleep, 5000
    SendInput {Raw}date +`%H:`%M:`%S:`%N`n
TheyKilledKenny commented 3 years ago

It's not something currently supported. In general the clock gating of the arm core is pretty good and the benefits are minor when lowering arm clock below 600MHz

Unfortunately by now we have more than 4K rpi2 v1.2 and v1.1 installed which are having continuous blocking problems and needing more reboots to restart, we are forced to operate manually on each to bring the kernel back to version 4.9.35 which allows us to manage the frequencies without any block, but this exposes to risks. Unfortunately it is impossible to equip the rpi2 with a cooling fan due to the environment in which they work, so in addition to the blocks, we are also having overtemperatures that we never had before. What is the problem that prevents from reenabling this very useful feature?

Please, let me know if you need more information than what @MichaIng has already given you, which seems to me already very complete and exhaustive.

This is a working rpi2: in config.txt we only changed: arm_freq_min=350 core_freq_min=150 temp_limit=70 (no undervolt or other settings)

cat /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state 350000 17858319 900000 31872 uname -a Linux HG-BOX-LIGHT 4.9.35-v7+ #1014 SMP Fri Jun 30 14:47:43 BST 2017 armv7l GNU/Linux

You can see that with this config.txt and the older kernel RPI2 is able to work without any block at 350000 that is our required minimum frequency. The current version get blocked every 2 or 3 days, or during some reboot

I apologize to the readers, but I must express my disappointment: The apparent superficiality with which this very big issue is treated and closed gives me a lot to think about... It doesn't seem serious to me to close a so big problem with a simple: "It is no longer supported" or modify the documentation as @JamesH65 wrote, leaving all in trouble. Please, show some commitment even for those who buy in batches and not only have to make gaming machines and overclocking, but things like environment monitoring, energy management, etc. I hope to see some sort of progress very soon on this issue which at the moment seems to be abandoned by you.

MichaIng commented 3 years ago

You can btw use the latest 4.19.x kernel, no need to stay with 4.9.x. The problem started with the intermediate frequency steps implemented with 5.4.x, while previous kernel versions only jump between min and max without any intermediate. Here the latest commit which you should be able to use: https://github.com/Hexxeh/rpi-firmware/tree/866751bfd023e72bd96a8225cf567e03c334ecc4

rpi-update 866751bfd023e72bd96a8225cf567e03c334ecc4

or the latest packages for Raspbian Stretch:

https://archive.raspberrypi.org/debian/pool/main/r/raspberrypi-firmware/raspberrypi-bootloader_1.20190819~stretch-1_armhf.deb
https://archive.raspberrypi.org/debian/pool/main/r/raspberrypi-firmware/raspberrypi-kernel_1.20190819~stretch-1_armhf.deb
https://archive.raspberrypi.org/debian/pool/main/r/raspberrypi-firmware/libraspberrypi0_1.20190819~stretch-1_armhf.deb
https://archive.raspberrypi.org/debian/pool/main/r/raspberrypi-firmware/libraspberrypi-bin_1.20190819~stretch-1_armhf.deb

With the current kernel, you can alternatively try to reduce voltage. -2 is what works very stable in my experience:

over_voltage=-2
over_voltage_min=-2

Not sure if it works as good as 350 MHz idle frequency regarding power consumption/temperature, but worth to give it a try. Another thing is the scaling governor. While ondemand is the default and hardcoded in raspi-config's init script, in my experience, schedutil works much better in both ways: raises the frequency quicker on load but lowers it as well much quicker on idle, and as a result in my use cases time_in_state show a much lower average frequency (lower power consumption and temperatures) with schedutil while not loosing responsiveness.

TheyKilledKenny commented 3 years ago

You can btw use the latest 4.19.x kernel, no need to stay with 4.9.x. The problem started with the intermediate frequency steps implemented with 5.4.x, while previous kernel versions only jump between min and max without any intermediate. Here the latest commit which you should be able to use: https://github.com/Hexxeh/rpi-firmware/tree/866751bfd023e72bd96a8225cf567e03c334ecc4 [...]

Thanks Micha,

unfortunately this is a problem that we can also encounter in some versions of the 4.19.X kernel, and it is only thanks to you that you opened this issue that we understood that the problem could be related to the scaling governor and kernel version.

Following your advice, (before I saw your previous message), I tried to change from 4.9.35 to 4.19.118 (rpi-update e1050e94821a70b2e4c72b318d6c6c968552e9a2) , but as soon as I did I immediately had total system freezes with no lines written in any log. As soon as I rpi-update to version 4.19.80 (c5736330216628b5ff8e3d17dde7cc03ce2126e6) all problems are gone.

I must point out that:

I tryed using -2 as overvoltage, but results in a no measurable working temperature reduction. Then I tryed -4 and I saw a little temperature reduction, but maybe it is more related to the external environment temperature reduction. The only thing useful to reduce the temperature is lower the frequency. If we can't lower the working frequency, we can't lower the temperature, that's why a permanent solution to this problem would be really appreciated. I'm going to test if it works with the last hash you wrote in the previous message, and I'll try to change also the governor, we never changed it.

I see that in a previous post you wrote that you managed to lower the frequency in a 5.4 commit on 24 August 2020 (https://github.com/raspberrypi/firmware/issues/1431#issuecomment-680270601) You think it's worth a try ?

Thank you!

MichaIng commented 3 years ago

Following your advice, (before I saw your previous message), I tried to change from 4.9.35 to 4.19.118 (rpi-update e1050e94821a70b2e4c72b318d6c6c968552e9a2) , but as soon as I did I immediately had total system freezes with no lines written in any log.

To be true I'm not sure whether this commit has even been released and tested in production environment. Probably the APT packages are a better go then, as those have and still are used on most production Raspbian Stretch systems. But if that 4.19.80 commit works stable now, probably not worth to change something about that πŸ™‚.

we're still based on raspbian Jessie light

Okay that is oldoldstable and soon oldoldoldstable already (does this suite codename even exist? πŸ˜„). I think 4.19 kernels have never been tested on Jessie systems, so the freeze you mentioned might even be related to that, e.g. the old systemd (init system) version being related or so, not sure.

but maybe it is more related to the external environment temperature reduction

Especially for temperature with precision and error range it is of course difficult to measure significant changes without laboratory conditions, measuring the power consumption over a longer period might work better. But yes I agree that lowering the frequency definitely had an effect and as well allowed to further lower over_voltage_min.

I see that in a previous post you wrote that you managed to lower the frequency in a 5.4 commit on 24 August 2020 (https://github.com/raspberrypi/firmware/issues/1431#issuecomment-680270601) You think it's worth a try ?

Yes it worked there, when not reducing voltage, but test it thoroughly before applying to production as those commits are in the middle between stable releases. The latest commit which still allows to lower the frequency on RPi 0-3 is:

rpi-update cc9ff6c7d1b9be5465c24c75941b049f94a6bd32

The next commit disabled it: https://github.com/Hexxeh/rpi-firmware/commits/ab9d6874ff67f7ef015d04358ad1e7711abe3f20

TheyKilledKenny commented 3 years ago

we're still based on raspbian Jessie light

Okay that is oldoldstable and soon oldoldoldstable already (does this suite codename even exist? πŸ˜„). I think 4.19 kernels have never been tested on Jessie systems, so the freeze you mentioned might even be related to that, e.g. the old systemd (init system) version being related or so, not sure.

You are absolutely right. The latest Kernel officially supported by Raspbian Jessie is 4.9.35, they only mantain firmware and kernel for the current distribution. (https://www.raspberrypi.org/forums/viewtopic.php?t=240508#p1467971). But even putting in place the necessary resources to update all devices currently running (4K), the current kernel would not allow us to underclock, so the problem would still be present. This is why I'm following this issue, but they close it in the worst way possible (at least for us) simply by removing the underclock option... (if there is no option, there is no issues with that option either). But closing the issues in this way, a real change of product specifications takes place without any notice for something I've already bought and own, and this is incorrect (@popcornmix).

I don't need multistep scaling, a switch between min and max would be enough if it worked. It would still be better than nothing. I'm going to test a bit more, but I suppose we have to keep using the old (unsafe) 4.9.x kernel.

Thanks @MichaIng for your support.

TheyKilledKenny commented 3 years ago

I did further test starting from 4.9.35 and going up to 5.4.77 I still didn't understand if it is kernel, firmware or a hw problem, but at the moment none of the rpi-update done over the 4.9.80 (5c80565c5c0c7f820258c792a98b56f22db2dd03) can last more then 2 days.

I found strange data, I hope this data will be useful to you more experts.

For the following tests I raised and lowered the ambient temperature in order to reach the throttled and capped states.

in config.txt: arm_freq_min=350 core_freq_min=150 temp_limit=70

I have a script that every 5 seconds write in a file the following informations: CPU = cat /sys/class/thermal/thermal_zone0/temp GPU = /opt/vc/bin/vcgencmd measure_temp Freq = cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq ARM = /opt/vc/bin/vcgencmd measure_clock arm CORE = /opt/vc/bin/vcgencmd measure_clock core FLAGS = /opt/vc/bin/vcgencmd get_throttled OTHER = /opt/vc/bin/vcgencmd read_ring_osc

this is the results with official 4.9.35 kernel with no cpu load. It is as expected.

|   CPU  |  GPU   |   Freq  |   ARM     |   CORE    |     FLAGS          |      other info
+--------+--------+---------+-----------+-----------+--------------------+-------------------------------------
| 68.7'C | 69.3'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.051MHz (@1.2000V)
| 68.7'C | 68.2'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.056MHz (@1.2000V)
| 68.7'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.055MHz (@1.2000V)
| 67.6'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.056MHz (@1.2000V)
| 67.1'C | 67.1'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.060MHz (@1.2000V)
| 67.1'C | 66.6'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.062MHz (@1.2000V)
| 64.4'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60000 | read_ring_osc(2)=3.065MHz (@1.2000V)
| 64.4'C | 64.5'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x60000 | read_ring_osc(2)=3.064MHz (@1.2000V)

Starting with kernel 4.14.x we can see that the CPU temperature is on average 5 degrees lower than the Core temperature. Is it possible in the same chip? is it true? The frequency appears to behave as expected, referring to the Core temperature (the highest) . Even if you need more power, when throttled = 0x60006 the ARM and GPU frequency are kept low.

|   CPU  |  GPU   |   Freq  |   ARM     |   CORE    |     FLAGS          |      other info
+--------+--------+---------+-----------+-----------+--------------------+-------------------------------------
| 59.4'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=3.067MHz (@1.2000V)
| 59.4'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=3.065MHz (@1.2000V)
| 60.5'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=3.062MHz (@1.2000V)
| 59.9'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=3.063MHz (@1.2000V)
| 59.9'C | 65.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=3.065MHz (@1.2000V)
| 60.5'C | 65.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=3.060MHz (@1.2000V)
| 61.0'C | 65.5'C |  350MHz | 900000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=3.082MHz (@1.2063V)
[...]
| 64.8'C | 69.8'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=3.053MHz (@1.2000V)
| 64.8'C | 69.8'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.053MHz (@1.2000V)
| 65.3'C | 70.4'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.050MHz (@1.2000V)
| 64.8'C | 70.9'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.051MHz (@1.2000V)
| 65.9'C | 70.9'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.046MHz (@1.2000V)
[...]
| 64.2'C | 69.3'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.054MHz (@1.2000V)
| 64.2'C | 69.8'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.054MHz (@1.2000V)
| 64.2'C | 69.3'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.050MHz (@1.2000V)
| 63.7'C | 68.8'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.051MHz (@1.2000V)
| 64.2'C | 69.3'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.053MHz (@1.2000V)
| 63.7'C | 68.8'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.055MHz (@1.2000V)
| 63.2'C | 68.8'C |  900MHz | 350002KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.056MHz (@1.2000V)
| 63.7'C | 68.2'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.055MHz (@1.2000V)
| 63.7'C | 68.8'C |  900MHz | 884000KHz | 250000KHz |  throttled=0x60002 | read_ring_osc(2)=3.076MHz (@1.2063V)
| 62.6'C | 67.7'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x60002 | read_ring_osc(2)=3.077MHz (@1.2063V)
| 62.6'C | 68.2'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x60002 | read_ring_osc(2)=3.079MHz (@1.2063V)
| 62.6'C | 67.1'C |  350MHz | 900000KHz | 250000KHz |  throttled=0x60002 | read_ring_osc(2)=3.079MHz (@1.2063V)
| 61.6'C | 66.6'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.057MHz (@1.2000V)

At this point I noted a commit from @popcornmix (bdb826a8db75ba36d754bd71fb64d3905d3bd026) that have the following description (1st row):

firmware: Rework the frequency/voltage scaling logic …
firmware: Clamp SDRAM frequencies only when sdm audio is active
See: #172
firmware: arm_dt: Improve DTB location, upstream kernel support
See: raspberrypi/firmware#943
@popcornmix
popcornmix committed on 7 Mar 2018

Starting from this commit, as soon as the rpi2 go to throttled state (0x20002 or 0x60002) the ARM frequency get crazy even if there is no cpu load:

 60.5'C | 65.0'C |  350MHz | 1148000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.985MHz (@1.2063V) (65.5'C)
 60.5'C | 65.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60000 | read_ring_osc(2)=2.960MHz (@1.2000V) (65.5'C)
 59.4'C | 65.5'C |  350MHz | 1147998KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.983MHz (@1.2063V) (65.0'C)
 60.5'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60000 | read_ring_osc(2)=2.961MHz (@1.2000V) (65.5'C)
 59.4'C | 65.0'C |  350MHz | 1148000KHz | 149999KHz |  throttled=0x60002 | read_ring_osc(2)=2.982MHz (@1.2063V) (65.0'C)
 59.4'C | 65.0'C |  350MHz | 349998KHz | 149999KHz |  throttled=0x60000 | read_ring_osc(2)=2.962MHz (@1.2000V) (65.0'C)

There is no overclock, but ARM Freq is read as 1148000KHz ????? shouldn't the maximum be 900Mhz on rpi2? This are only few lines to show you, but there are hours of logs with the same behaviour and the timestamp if someone need. This behaviours is the same for every other kernel up to the 5.4.77

Here the same with kernel 5.4.77 (here is arm_freq_min=300), works as expected during ARM Capped (0x60006) or not throttled state (0x0, 0x60000 or 0x20000), but the ARM frequency goes crazy when it is in the throttled state (0x20002 or 0x60002).

This throttled state seems where all my RPIs keep crashing (sometime after 2 hrs, sometime after 2 days)

| 69.8'C | 69.8'C |  900MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.051MHz (@1.2000V) (69.8'C)
| 69.2'C | 69.3'C |  400MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.052MHz (@1.2000V) (69.8'C)
| 69.2'C | 69.3'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.058MHz (@1.2000V) (69.3'C)
| 68.7'C | 69.3'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.056MHz (@1.2000V) (68.8'C)
| 68.7'C | 69.3'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.059MHz (@1.2000V) (67.7'C)
| 68.7'C | 68.2'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.055MHz (@1.2000V) (68.8'C)
| 67.6'C | 68.2'C |  300MHz | 633000KHz | 166667KHz |  throttled=0x60002 | read_ring_osc(2)=3.396MHz (@1.2063V) (67.7'C)
| 67.6'C | 68.2'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=3.061MHz (@1.2000V) (67.7'C)
| 67.6'C | 68.2'C |  300MHz | 633000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.085MHz (@1.2063V) (68.2'C)
| 67.6'C | 68.2'C |  300MHz | 633000KHz | 166666KHz |  throttled=0x60006 | read_ring_osc(2)=3.058MHz (@1.2000V) (68.2'C)
| 67.6'C | 67.7'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.060MHz (@1.2000V) (68.8'C)
| 67.1'C | 67.7'C |  300MHz | 300000KHz | 166667KHz |  throttled=0x60002 | read_ring_osc(2)=3.086MHz (@1.2063V) (67.7'C)
| 67.6'C | 67.7'C |  300MHz | 633000KHz | 166667KHz |  throttled=0x60002 | read_ring_osc(2)=3.085MHz (@1.2063V) (67.7'C)
| 67.6'C | 67.7'C |  300MHz | 300000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.061MHz (@1.2000V) (67.7'C)
| 67.1'C | 67.7'C |  300MHz | 633000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.084MHz (@1.2063V) (68.2'C)
| 67.1'C | 67.7'C |  300MHz | 686000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.088MHz (@1.2063V) (67.1'C)
| 67.1'C | 67.1'C |  300MHz | 686000KHz | 166666KHz |  throttled=0x60002 | read_ring_osc(2)=3.085MHz (@1.2063V) (67.7'C)
| 67.1'C | 67.7'C |  300MHz | 686000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.085MHz (@1.2063V) (67.1'C)
| 67.1'C | 67.1'C |  300MHz | 740000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.090MHz (@1.2063V) (66.6'C)
| 67.1'C | 66.6'C |  300MHz | 686000KHz | 166666KHz |  throttled=0x60002 | read_ring_osc(2)=3.088MHz (@1.2063V) (67.7'C)
| 66.6'C | 67.1'C |  300MHz | 740000KHz | 166666KHz |  throttled=0x60002 | read_ring_osc(2)=3.087MHz (@1.2063V) (67.1'C)
| 66.6'C | 66.6'C |  300MHz | 740000KHz | 149999KHz |  throttled=0x60002 | read_ring_osc(2)=3.086MHz (@1.2063V) (66.6'C)
| 66.6'C | 66.6'C |  300MHz | 740000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.089MHz (@1.2063V) (67.1'C)
| 67.1'C | 66.6'C |  500MHz | 686000KHz | 183333KHz |  throttled=0x60002 | read_ring_osc(2)=3.084MHz (@1.2063V) (67.7'C)
| 67.1'C | 67.7'C |  500MHz | 633000KHz | 200000KHz |  throttled=0x60002 | read_ring_osc(2)=3.085MHz (@1.2063V) (67.1'C)
| 67.1'C | 67.1'C |  500MHz | 686000KHz | 183333KHz |  throttled=0x60002 | read_ring_osc(2)=3.084MHz (@1.2063V) (66.6'C)
| 67.6'C | 67.1'C |  500MHz | 686000KHz | 183333KHz |  throttled=0x60002 | read_ring_osc(2)=3.077MHz (@1.2063V) (67.7'C)
| 67.6'C | 67.7'C |  600MHz | 633000KHz | 200000KHz |  throttled=0x60002 | read_ring_osc(2)=3.080MHz (@1.2063V) (67.7'C)
| 67.1'C | 67.1'C |  500MHz | 633000KHz | 183333KHz |  throttled=0x60002 | read_ring_osc(2)=3.086MHz (@1.2063V) (67.1'C)
| 66.6'C | 66.6'C |  300MHz | 740000KHz | 166666KHz |  throttled=0x60002 | read_ring_osc(2)=3.085MHz (@1.2063V) (66.6'C)
| 66.6'C | 66.6'C |  300MHz | 793998KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.091MHz (@1.2063V) (66.1'C)
| 66.0'C | 65.5'C |  300MHz | 794000KHz | 166666KHz |  throttled=0x60002 | read_ring_osc(2)=3.088MHz (@1.2063V) (66.6'C)
| 65.5'C | 66.1'C |  300MHz | 848000KHz | 166667KHz |  throttled=0x60002 | read_ring_osc(2)=3.089MHz (@1.2063V) (65.5'C)
| 66.0'C | 65.5'C |  300MHz | 740000KHz | 166667KHz |  throttled=0x60002 | read_ring_osc(2)=3.081MHz (@1.2063V) (66.1'C)
| 65.5'C | 65.5'C |  300MHz | 740000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.092MHz (@1.2063V) (65.5'C)
| 66.0'C | 66.1'C |  300MHz | 848000KHz | 166667KHz |  throttled=0x60002 | read_ring_osc(2)=3.090MHz (@1.2063V) (65.5'C)
| 65.5'C | 65.5'C |  300MHz | 848000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=3.092MHz (@1.2063V) (65.5'C)

--- RPI BLOCKED ---

In this moment I'm doing the last tests on 4.14.21 to 4.14.24 because I need a patch for sc16is7xx: Fix for multi-channel stall that was availlable from 4.14.21

Hope this data can be useful to someone.

Thank you.

MichaIng commented 3 years ago

Interesting find. Which CPU scheduling governor did you use for these tests? Note that only reading the temps/stats from those files and especially executing vcgencmd raises CPU load, loading to a raised frequency while looping through the steps at a single iteration. I'm not 100% sure what the "frequency capped" state (second last bit, the two at the end) compared to the "throttled" state (3rd last bit, the 2+4=6 at the end) means, but with "frequency capped" the frequency is likely not forced to the minimal one, but capped in stages depending on the temperature. The CPU load can then explain the up to 848000KHz ARM frequencies (the maximum was never hit, so I don't see those going "crazy") at kernel 5.4.77.

The 1148000KHz at lower kernel versions indeed looks like a bug, but since kernels 5.4.77 and up do not show this behaviour, I guess there is no motivation to fix or even investigate this. Especially since I think vast parts of the scaling driver have been moved to the upstream implementation, hence the old code the commits refer to are likely gone completely.

Finally, I'd love to have a testing branch with lowering arm_freq_min enabled on Linux 5.10 for all RPi models, so interested users can keep testing and investigating it to hopefully narrow down the underlying issue, especially since also RPi 4 suffers from problems with this, where it is still enabled.

TheyKilledKenny commented 3 years ago

Interesting fine. Which CPU scheduling governor did you use for these tests? Note that only reading the temps/stats from those files and especially executing vcgencmd raises CPU load, loading to a raised frequency while looping through the steps at a single iteration. I'm not 100% sure what the "frequency capped" state (second last bit, the two at the end) compared to the "throttled" state (3rd last bit, the 2+4=6 at the end) means, but with "frequency capped" the frequency is likely not forced to the minimal one, but capped in stages depending on the temperature. The CPU load can then explain the up to 848000KHz ARM frequencies (the maximum was never hit, so I don't see those going "crazy") at kernel 5.4.77.

The 1148000KHz at lower kernel versions indeed looks like a bug, but since kernels 5.4.77 and up do not show this behaviour, I guess there is no motivation to fix or even investigate this. Especially since I think vast parts of the scaling governor have been moved to the upstream implementation, hence the old code the commits refer to are likely gone completely.

Finally, I'd love to have a testing branch with lowering arm_freq_min enabled on Linux 5.10 for all RPi models, so interested users can keep testing and investigating it to hopefully narrow down the underlying issue, especially since also RPi 4 suffers from problems with this, where it is still enabled.

Governor is still the default, I tryed to keep as standard as possible.

For what I understood:

at least this is how 4.9.35 behaves.

I wrote 'gets "crazy"' because it rises for no apparent reason and the calls to vcgencmd are always the same (same load). I have an entire day logged in Throttled state and the frequency is always high, and the entire day after, in not throttled state, and the frequency is always at the minimum. Without any load on the CPU the frequency should be kept as low as when it is not in throttling state, for some more reason, I suppose. I'm asking if the vcgencmd measure_clock arm is returning the correct current value or it is bugged in throttled state. If it is the correct value then it seems a nosense to me.

The test of 5.4.77 was a one shot. It froze very soon, so I don't know if the frequency rises more than that, but it has the same "strange" behaviour when in throttled state.

For vcgencmd get_throttled the bit map is the following:

111100000000000001010
||||             ||||_ under-voltage
||||             |||_ currently throttled
||||             ||_ arm frequency capped
||||             |_ soft temperature reached
||||_ under-voltage has occurred since last reboot
|||_ throttling has occurred since last reboot
||_ arm frequency capped has occurred since last reboot
|_ soft temperature reached since last reboot

Let me know if you need more data.

TheyKilledKenny commented 3 years ago

Done some more testing.

I can confirm that

This is the log using 4.14.24 (2659c9e87b574b3b05eacef80961c404ed0f0ce3), the last working:

Here you can see how the ARM frequency behaves before, during and after a throttled state Freq (cpufreq / scaling_cur_freq) and ARM (vcgencmd arm_freq) show similar values and in line with each other, moreover the read values are almost always rounded (350000, 900000 and not 1147998 or 793998) in both States.

|   CPU  |  GPU   |   Freq  |   ARM     |   CORE    |     FLAGS          |      other info
+--------+--------+---------+-----------+-----------+--------------------+-------------------------------------
| 58.3'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.963MHz (@1.2000V)
| 59.4'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.964MHz (@1.2000V)
| 59.4'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.979MHz (@1.2063V)
| 58.9'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.961MHz (@1.2000V)
| 59.4'C | 64.5'C |  350MHz | 900000KHz | 250000KHz |  throttled=0x20000 | read_ring_osc(2)=2.978MHz (@1.2063V)
| 59.4'C | 65.0'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x20000 | read_ring_osc(2)=2.974MHz (@1.2063V)
| 59.9'C | 65.5'C |  350MHz | 900000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.974MHz (@1.2063V)
| 59.9'C | 64.5'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.976MHz (@1.2063V)
| 60.5'C | 65.5'C |  900MHz | 899998KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.973MHz (@1.2063V)
| 60.5'C | 65.0'C |  350MHz | 350002KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.961MHz (@1.2000V)
| 59.4'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.962MHz (@1.2000V)
| 59.9'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.978MHz (@1.2063V)
| 59.9'C | 65.5'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.979MHz (@1.2063V)
| 60.5'C | 65.5'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.973MHz (@1.2063V)
| 60.5'C | 65.5'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.972MHz (@1.2063V)
| 60.5'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.956MHz (@1.2000V)
| 60.5'C | 65.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.960MHz (@1.2000V)
| 59.9'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.960MHz (@1.2000V)
| 59.9'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.957MHz (@1.2000V)
| 59.4'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.959MHz (@1.2000V)
| 59.4'C | 65.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.958MHz (@1.2000V)
| 59.4'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.959MHz (@1.2000V)
| 59.9'C | 64.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20000 | read_ring_osc(2)=2.960MHz (@1.2000V)

In the following lines you can see what happens when you enter in Capped state. in the 3rd line I start asking for load on the cpu (see Freq = 900) using our software (a java 1.8 sw that reads and writes via a sc16is752 spi to uart chip) When rpi has reached the temperature limit it started capping ARM and in this state you can see that even when I load the cpu (Freq = 900), ARM frequency is always at minimum arm_freq_min set in config.txt. As soon as the temperature drops and the Capped state is removed, the ARM frequency starts following the scaling requests again we see in Freq, no "crazy" frequency here. I all my test I noted that Freq (cpufreq/scaling_cur_freq) should be what you "need" and ARM (vcgencmd arm_freq) should be what you get.

|   CPU  |  GPU   |   Freq  |   ARM     |   CORE    |     FLAGS          |      other info
+--------+--------+---------+-----------+-----------+--------------------+-------------------------------------
| 64.8'C | 69.8'C |  350MHz | 350000KHz | 149999KHz |  throttled=0x20002 | read_ring_osc(2)=2.951MHz (@1.2000V)
| 64.8'C | 69.3'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x20002 | read_ring_osc(2)=2.952MHz (@1.2000V)
| 64.8'C | 69.8'C |  900MHz | 726000KHz | 250000KHz |  throttled=0x20002 | read_ring_osc(2)=2.967MHz (@1.2063V)
| 65.9'C | 70.5'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.950MHz (@1.2000V)
| 66.4'C | 71.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.949MHz (@1.2000V)
| 66.4'C | 71.4'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.949MHz (@1.2000V)
| 66.8'C | 71.9'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.949MHz (@1.2000V)
| 66.9'C | 72.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.951MHz (@1.2000V)
| 67.5'C | 72.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.948MHz (@1.2000V)
| 66.9'C | 72.0'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.948MHz (@1.2000V)
| 66.9'C | 72.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.949MHz (@1.2000V)
| 66.9'C | 72.0'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.947MHz (@1.2000V)
| 66.0'C | 71.1'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.946MHz (@1.2000V)
| 63.0'C | 70.5'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.945MHz (@1.2000V)
| 62.9'C | 69.5'C |  900MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.945MHz (@1.2000V)
| 63.6'C | 68.8'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60006 | read_ring_osc(2)=2.959MHz (@1.2000V)
| 62.6'C | 68.2'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.960MHz (@1.2000V)
| 63.2'C | 68.2'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.959MHz (@1.2000V)
| 62.6'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.959MHz (@1.2000V)
| 63.2'C | 68.8'C |  900MHz | 890000KHz | 250000KHz |  throttled=0x60002 | read_ring_osc(2)=2.977MHz (@1.2063V)
| 62.6'C | 68.2'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.958MHz (@1.2000V)
| 63.2'C | 68.8'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.958MHz (@1.2000V)
| 62.6'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.957MHz (@1.2000V)
| 62.6'C | 68.8'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.960MHz (@1.2000V)
| 63.2'C | 68.2'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.956MHz (@1.2000V)
| 62.6'C | 68.2'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.957MHz (@1.2000V)
| 63.2'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.957MHz (@1.2000V)
| 62.6'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.978MHz (@1.2063V)
| 62.6'C | 67.1'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.958MHz (@1.2000V)
| 63.2'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.959MHz (@1.2000V)
| 62.6'C | 68.2'C |  900MHz | 900000KHz | 250000KHz |  throttled=0x60002 | read_ring_osc(2)=2.975MHz (@1.2063V)
| 62.6'C | 67.7'C |  350MHz | 350000KHz | 150000KHz |  throttled=0x60002 | read_ring_osc(2)=2.959MHz (@1.2000V)

@popcornmix What is changed in this your commit regarding the frequency/voltage scaling logic? https://github.com/Hexxeh/rpi-firmware/commit/bdb826a8db75ba36d754bd71fb64d3905d3bd026 What was broken in that commit was never fixed, and I've tried almost all of them.

Attached you can find the complete day of this log with timestamps If really needed I can post the same log but using 5.4.x or 5.10.x, but the strange behaviour is there.

TempAndFreq_4.14.24.zip

popcornmix commented 3 years ago

The commit message for the "Rework the frequency/voltage scaling logic" is:

The Pi3+ requirement for an extra voltage step doesn't fit the existing turbo/non-turbo logic. The existing code has been added to so many times with new requirements and it is getting very hard to maintain. This is a rewrite that tries to put all of the platform specific freq/voltage/temperature scaling code into a single function in platform. No more turbo mode, but extend the idea of boost frequencies whenever frequencies above idle ones are requested. All frequency changes come though set_turbo_mode which adjusts everything to latest current boost values. Unchanged clocks should be skipped by comparing current with cached values. Changes in behaviour: a gpu boost (e.g. for playing 60fps H.264 video) won't trigger an arm boost. I think this makes more sense and if previous behaviour is desired we can always request the arm boost directly from video_decode.

and it is a rewrite that consisted of 35 commits. It's not something that can be trivially described.

But I'm not sure what the value is in debugging a 3 year old version of the firmware. Can you explain simply what the problem is with the latest firmware/kernel?

I know arm_freq cannot be reduced below 600MHz. That is an issue we don't currently have a solution for (without introducing instability), and so it is currently unsupported.

TheyKilledKenny commented 3 years ago

But I'm not sure what the value is in debugging a 3 year old version of the firmware. Can you explain simply what the problem is with the latest firmware/kernel?

I know arm_freq cannot be reduced below 600MHz. That is an issue we don't currently have a solution for (without introducing instability), and so it is currently unsupported.

Reading costs lot less time than writing, you could have made a little effort, given all the time I'm wasting on it. (bool respectForPeople = false;)

But if a problem in the frequency management of the cpu is not important to you... Okay.

Unfortunately, with the 5.10.x kernel you have introduced another bug related to the SPI and the management of the SC16IS752 chip (cpu 100% always), so at the moment it is not an option for us and it makes no sense do more tests for me. This would be your job, not mine, given your careless answer I think I have already wasted too much time on it

(SPI problem: here the firt 3 processes rows of TOP command using kernel 5.10.52)

  191 root     -51   0       0      0      0 R  94,6  0,0   4:47.31 irq/199-spi0.0
  186 root      20   0       0      0      0 R  46,2  0,0   2:23.20 spi0
 2493 root      20   0    5760   2608   2112 R   0,7  0,3   0:00.51 top

Due to this other SPI bug THERE IS NO KERNEL we can use if above 4.19.24 !!!!!!!!!!!!!!!!!!!!

BUT

I see from your too hasty (and I think useless) answer that you don't worry about it, you don't care about it and instead of collaborating, you prefer to hide behind the formality of what is officially supported.

Playng at your role game we need to remind you that we have purchased more than 4000 pieces almost exclusively for:

Your unilateral decision to remove this vital feature for us on March 31, 2021 (https://github.com/raspberrypi/firmware/issues/1431#issuecomment-811151218), without giving any prior notice and also clearly showing unwillingness to restore it, is clearly a violation of our rights as customers.

At this time it is not possible for us to use the products that we have regularly purchased and paid for, without accepting important security risks, which is not acceptable.

This reinforces my opinion that, as mentioned in a previous post that raspberrypi deleted, the Raspberry PI is not ready yet and is not intended for a market other than retro gamers or hobby projects, despite the time that has elapsed since RPI1

Have a nice day.

pelwell commented 3 years ago

Reading costs lot less time than writing, you could have made a little effort, given all the time I'm wasting on it. (bool respectForPeople = false;)

It's about scale/gearing - there are many of you and very few of us. You are already diluting the utility of our time by requesting support for an obsolete version of our kernel, and you clearly know your own situation, so I don't consider asking for a concise restatement of a problem or requirement to be disrespectful. On the whole I think we do pretty well on the respect front, even with some of the more challenging members of the community (and I'm thinking of nobody in particular, and definitely not you).

JamesH65 commented 3 years ago

So, I have read this thread, this is the precis I get (Ignoring the SPI stuff - that seems to be a different issues and should have its own thread)

Customer is using a 2B, and it appears to be right on the edge of acceptable thermals. Customer has reduced the temperature limit to 70degC, reason is unclear. To further reduce temperature of device installation they wish to reduce the min ARM frequency below the current 600 minimum, as this apparently reduced temperatures by up to 5degC.

Unfortunately, reducing the minimum frequency is an unusual use case, the vast majority of users want more power, not less. AIUI, the current frequency management is targeted more at the majority, which seems the obvious choice given the limited HW support for all the frequencies required to be generated.

I don't know what the options are here; to break the frequency management for most users to satisfy this use case seems counterproductive. I think it would be worth the customer retesting temperatures )if they haven't already) with the latest kernel and firmware, as it's certainly possible that the better management of the device with this combination may well give better results than dropping the ARM frequency. I'm not sure why the high temp limit has been reduced to 70, I would also try things with that removed.

MichaIng commented 3 years ago

I urge everyone to stop blaming anyone else for anything else and keeping this thread productive. Development time is limited and given the outstanding large user space of RPi vs any other SBC manufacturer, the Raspberry Pi foundation does an awesome job, obviously 🌞! Many other manufacturers do (nearly) no official kernel and/or userspace development, just to make clear what to compare with. Furthermore please stop posting/discussing unrelated topics, the SPI issue as well as the "crazy" frequencies on throttled states need to get into their own issues.

Back on the actual issue:

@JamesH65

to break the frequency management for most users to satisfy this use case

Re-enabling an optional feature does not effect anyone else and does not "break the frequency management for most users".

I think it would be worth the customer retesting temperatures )if they haven't already) with the latest kernel and firmware,

I would love to, i.e. comparing power consumption and temperatures with and without ARM frequency lowered on non-RPi4 with latest kernel, but since the feature is disabled, I cannot πŸ€”.

JamesH65 commented 3 years ago

AIUI, the frequency management in the firmware has changed, a lot. As popcornmix states above, reintroducing the ability to go below 600 causes instability, for which we do not have a fix, without affecting the other frequency management. That's the point. Other stuff will suffer if this feature is reintroduced. The whole area of frequency management, thermal throttling etc is very complex, and made even more complicated because of the small number of PLL's available, and the large number of set frequencies we need to generate for the various peripherals. If it were as easy as "turn it back on", we would have done that. If it was as easy as "turn it back on and tweak a few things" we would also have done that!

My point about testing was does the latest firmware without the <600 feature match the power consumption of the previous firmware with the 600 wghen dropped lower. That CAN be tested. In which case you don;t need the <600 feature to match the consumption you had before. I have no idea if that will be the case, but the whole point of these management changes is to reduce power requirements overall.

MichaIng commented 3 years ago

Thanks for clarification!

reintroducing the ability to go below 600 causes instability

More than with the last 5.4 kernel which allowed it? Since with that one it was stable for me. But I guess the assessment of instability is based on more than my test results, shared further above that time.

And is it so much different on RPi4, where it is still possible, despite the instability that can be in fact caused with it as well, as linked?

If there is an obvious hardware limitation on older RPi models, regarding the number of PLL's, then we need to accept it. I mean of course with kernel 4.19 => 5.4 the major change was the implementation of the intermediate frequency states, having many new states especially with reduced min frequency.

My point about testing was does the latest firmware without the <600 feature match the power consumption of the previous firmware with the 600 wghen dropped lower.

Ah okay, above was stated:

The latest kernel simply ignores the arm_freq_min parameter and this results in increasing the cpu temperature by about 3 to 5Β° C during normal operation with no load on the cpu (vcgencmd measure_temp arm) Unfortunately this also leads to a much faster rise in CPU temperature as the load rises.

Probably I find time to test this as well. But actually, even if the newer kernel would be more efficient elsewhere, this would not break the argument, as lowering power consumption and heat dissipation further would be still better. I wouldn't see this as user-specific question, whether it is "sufficient" or not as it is now, but it would be an enhancement in every case. But of course I cannot evaluate whether this can be achieved with acceptable effort and satisfying outcome overall.

JamesH65 commented 3 years ago

I believe there are more PLL's on the 2711, BUT there are also more peripherals to supply frequencies for. So there is still some juggling required to sort out all the clocks. The rule is generally, when you have n PLL's the number you actually need is n+1. PLL's take up a lot of silicon though which is why we always get n where n is too small.

I'm not expert of the link between the firmware and the Linux kernel frequency management, that's popcornmix, but the ultimate arbiter is the firmware has control as it need to ensure temperatures don't get too high.

TheyKilledKenny commented 3 years ago

I urge everyone to stop blaming anyone else for anything else and keeping this thread productive. Development time is limited and given the outstanding large user space of RPi vs any other SBC manufacturer, the Raspberry Pi foundation does an awesome job, obviously 🌞! Many other manufacturers do (nearly) no official kernel and/or userspace development, just to make clear what to compare with. Furthermore please stop posting/discussing unrelated topics, the SPI issue as well as the "crazy" frequencies on throttled states need to get into their own issues.

Back on the actual issue:

All the data I produced here are related to tests conducted for this specific issue on 5.4.x kernel ([Kernel5.4] Lowering arm_freq_min leads to system hang/crash), I starded 1 year ago to test. All the tests I have done with previous kernel versions are for the sole purpose of figuring out where the problem was introduced (since it used to work before) and providing you with more information.

The throttling problem is related to this because during all my tests the 5.4.x firmware hang/crash (as well as for all the other kernels tested) seemed more related to the change of state of the throttling, for this reason I conducted more specific tests on those states and I noticed that there was a problem there, this is why I decided to report the collected data on this issue. But you are the experts and if you say it has nothing to do with it, I believe it.

I think that there is no need to have many frequencies, if this is the problem of instability. For most purposes it would be sufficient even to have a minimum and a maximum frequency, with the possibility of adjusting the desired values as allowed by the hardware that exists (no need for more pll or anything else).

SPI has nothing to do with it here, it's just to say I can't test 5.10.x. for other reasons, but this issue is still for 5.4.x

Nothing against raspberry pi and I understand and appreciate all the reasons and merits, but I have to draw my conclusions and act accordingly. I found it not very comforting to be completely ignored and to receive too hasty answers (even considering all the more specific data you provided too) for a problem at this level that affects a function (I would say) important, and could also involve something else (as I seemed to understand from other messages in this issue).

Thanks to all.

JamesH65 commented 3 years ago

Nothing to do with the problem, but in order to improve our support, it would be interesting to know why you feel you have been ignored, or you think replies have been hasty. Reading this thread, your posts have been replied to fairly swiftly by the right engineers, albeit not with the results you were hoping for. As for the hasty answers, I've also read those and whilst brief, they all appear to be accurate. Our engineers are spread very thin, that means replies can be brief as we have a lot of work to do. So would be interested in the reasoning if you have time to comment.

TheyKilledKenny commented 3 years ago

Nothing to do with the problem, but in order to improve our support, it would be interesting to know why you feel you have been ignored, or you think replies have been hasty. Reading this thread, your posts have been replied to fairly swiftly by the right engineers, albeit not with the results you were hoping for. As for the hasty answers, I've also read those and whilst brief, they all appear to be accurate. Our engineers are spread very thin, that means replies can be brief as we have a lot of work to do. So would be interested in the reasoning if you have time to comment.

Thank you for the question even though I think we bore everyone else here, it is difficult to answer this question without entering into controversy.

I did not understand which of the answers you are referring to, they probably answered quickly, but an answer should first be reasoned and then also filled with content. So far I have only read answers from people who wrote to me that they did not have time to investigate or it was not worth doing, using reasons that were already denied or explained in the same message they were replying to. Answers like this one https://github.com/raspberrypi/firmware/issues/1431#issuecomment-883618381 or like: "You are already diluting the utility of our time by requesting support for an obsolete version of our kernel", the last from @pelwell, besides not being very nice, show that they haven't even read the message they were replying to where I clearly wrote that I have encountered the problem on ALL versions 4.19 and 5.4.x (5.4.x is the target of this issue) and the problem found is the title of this issue. And so I say polemically: Maybe they're the one who took too long to get interested in this issue (still they are not) and now we have a new kernel?

After all the tests I have carried out (1 year) I stated in my messages, with absolute certainty, that:

But the answer @popcornmix gave me clearly demonstrates that he considers the problem unimportant to take the time to investigate the matter, and (as if to tease me) also asks me: "Can you explain simply what the problem is with the latest firmware / kernel?" forgetting that this issue is about 5.4.x and the issue is always the title of this issue.

All the answers received (except the one from @MichaIng) were limited to highlighting my inaccuracies (as ignorant as I am), but no real answer, concrete solution or at last a will to do was given apart something that sounds like: "we have blocked it, now the problem is not there, so don't bother us anymore" I understand the time, I understand the problems, but this is not an answer I expect from a technician. In none of the answer I have read the intention to solve it.

Some of the reasons I read are even more worrying. How can it be said nowadays that those who buy rpi2b just want to increase performance? Those who want performance today buy rpi3 at similar prices or the more performing rpi4, certainly not rpi2 or rpi1. Your answers therefore indicate very clearly to me that rpi2b is no longer actively supported and you're focused mainly on rpi4. Declare it openly would be more honest. Rpi2b is a low power embedded board, not a gaming machine.

So at the end with the answers I revceived from you I realized that:

I apologize for my bad English, I hope I have been able to explain to you the reason for my frustration. Do not worry, as this CPU frequency management problem is told by you to be an irrelevant problem that concerns only me and does not interest anyone else, I will not bother you anymore in the future.

Have a nice day.

MichaIng commented 3 years ago

I still think it makes sense to split the issue up, as the collected information refers to different kernel versions, different RPi models, some related to reduced idle frequency, others not. Sure they could be related, but it might be easier to investigate each issue isolated, to allow engineers better reproduce or get targeted debug results. My suggestion would be:

Does this make sense?

And to help your particular case @TheyKilledKenny, did you go through all other possibilities to disable hardware features and lower power consumption (== heat dissipation) by this? As a related question came up, I collected things I applied on my headless RPi2 here: https://github.com/raspberrypi/firmware/issues/1577#issuecomment-891811909 I'm not sure which one of these really has an effect on power consumption, only for tvservice -o I know it definitely has. I can add some more details about how to compile and install the device tree overlays, if required. This combined with lowered voltage and schedutil governor is at least worth to be tested to get back to good temperature results.

C0D3-M4513R commented 3 years ago

@MichaIng I could retest in a week. Personal contraints make it currently impossible for me to test anything, or access a pc.

TheyKilledKenny commented 3 years ago
* Then we have the SPI issue: [[Kernel5.4] Lowering arm_freq_min leads to system hang/crash #1431 (comment)](https://github.com/raspberrypi/firmware/issues/1431#issuecomment-890969610)
  This out of question deserves an own issue, probably on the kernel repository? I checked existing ones, probably [this](https://github.com/raspberrypi/linux/issues/4351) is similar, at least the same driver involved? There are some related kernel errors shown and some debug steps. @TheyKilledKenny probably you are able to verify whether this matches your case. While that report does not include CPU usage, probably it is simply not that significant on Raspberry Pi 4, compared to Pi 2.

Thank you very much. Too early to talk about SPI. At the moment I have not investigated too much about the SPI problems on new kernels because, lacking the possibility to reduce the frequency, it was of no use to me. I also thank you for your suggestions, but my particular case has no hope if the possibility of lowering the cpu/core frequency is not restored.

Thank you.

FYI in my particular use case, not importat for this issue:

I have already tried your valuable suggestions including undervoltage, but even if I have achieved some small results, the only real way that gives concrete results to keep the temperature low is to lower the working frequency. At 300Mhz we get 5 to 10 degrees lower working temperatures (depending on the position) than at 600Mhz and when we are throttled at 65 degree and capped at 70 degrees, working at 300Mhz we are pretty sure not only that we will not reach the critical temperature of 90 degrees, but that for most of the time will work at an optimal temperature between 43 and 63 degrees, ensuring a longer life. The default minimum of 600Mhz is not able to ensure a cool down in every condition mainly if the environment temperature goes over 43 degree. Under these conditions you see that the temperature continues to rise and this phenomenon of temperature rise is more evident during the throttling state in the kernels where the reading of the ARM frequency returns values greater than the minimum expected in a limited state, called by me "crazy". If someone need this particular behavior, this commit (kernel 4.19.24) is the last able to manage it without system hang/crash: https://github.com/Hexxeh/rpi-firmware/commit/2659c9e87b574b3b05eacef80961c404ed0f0ce3

I'm not a fan of under/overvoltage. Working with many pieces, I believe that components should work at the specified voltage to avoid possible fluctuations in manufacturing error tolerances, but I tried that too.

Really thank you.

MichaIng commented 3 years ago

Working with many pieces, I believe that components should work at the specified voltage to avoid possible fluctuations in manufacturing error tolerances, but I tried that too.

That is a valid concern indeed, I haven't thought about. I'm pretty sure the stable lower voltage limit is depending on the individual device, and actual issues of too low voltage can include randomly occurring data loss (especially on USB drives), so that needs to be tested and by times monitored on the single system, not really applicable on a farm.