vladkinoman / just-my-notes

Just my awesome notes :sunglasses:
11 stars 1 forks source link

Sudden PC shutdown on Linux Mint 20 while gaming on the discrete AMD GPU #4

Open vladkinoman opened 4 years ago

vladkinoman commented 4 years ago

Sudden PC shutdown on Linux Mint 20 while gaming on the discrete AMD Radeon 8750M GPU

Description

Computer suddenly shutdowns on Linux Mint 20 xfce while gaming on the discrete AMD Radeon 8750M GPU. Shutdown occurs in 5 minutes from the beginning of the game. I think this moment comes when the CPU temperature reaches 105 C (cause it is a critical temperature according to sensors). I know that a hardware temperature of 70o C and above is too hot, and could cause your system to crash.

People use the next command to check the overheating hypothesis:

$ grep -i -e temp -e therm /var/log/syslog*

Here is what I found out:

/var/log/syslog.1:Aug 12 23:39:56 HP-ProBook-450-G0 kernel: [24659.739233] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
/var/log/syslog.1:Aug 12 23:39:56 HP-ProBook-450-G0 kernel: [24659.739235] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
/var/log/syslog.1:Aug 12 23:39:56 HP-ProBook-450-G0 kernel: [24659.739236] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
/var/log/syslog.1:Aug 12 23:39:56 HP-ProBook-450-G0 kernel: [24659.739238] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
/var/log/syslog.1:Aug 12 23:39:56 HP-ProBook-450-G0 kernel: [24659.739239] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
/var/log/syslog.1:Aug 12 23:39:56 HP-ProBook-450-G0 kernel: [24659.739241] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)

So, this is clearly an overheating problem.

People also suggest running dmesg command after boot and before starting applications.

Everything is alright on Windows 10. When the temperature reaches 105 degrees, the computer does not shut down. And after a while, one of the fans turns on fully (possibly, GPU) and starts to make a lot of noise. After that, the temperature drops to 85-90. You can see that Linux has a much more serious policy since it turns off when it reaches 105 degrees. I'm glad Linux takes care of the hardware, but why do these operating systems work so differently? Maybe Linux just doesn't know that it has the ability to spin the fan as well?

Speaking of politics. Someone found an answer to a similar question that is based on this bug. All you need to do is to upgrade Kernel to the next version. Although, somebody thinks it's better to upgrade to the previous version. I speak about this below, in the "Possible answers" section.

To reproduce

  1. Launch Steam in the terminal with PRIME environment variable:
    $ DRI_PRIME=1 steam

    That's how you use a discrete video card.

  2. Start a heavy game. For instance, I launched The Witcher 2. Although, This War of Mine was enough.
  3. Do something, wait for the shutdown. During this time, I was checking the temperature of the CPU in the i7z application.

My observations:

IDLE:
    min - 55
    max - 60  
AMD: 
    min - 88
    max - 103 (perhaps 105, you can't notice this because the computer gets shutdown)

Possible answers

So, the first problem is overheating. However, what causes this overheating?

  1. Drivers. That's the most likely answer. Drivers are always a problem, especially on Linux, since many companies do not allow others to support drivers for their devices.
  2. Dust. I don't think so, but it needs checking.
  3. Thermal paste. Likely.
  4. Fans don't work properly. I think there are two of them. Maybe, the second one (GPU) just doesn't work on this system. Perhaps they work but not at full capacity. The problem is also that I cannot see them using sensors or other applications. It seems that the motherboard does not have any special pins to identify them.
  5. The Kernel version is too new. People suggest installing/rolling back to the previous version (i.e. 4.15).
  6. People also recommended thermald but it didn't help me. However, it gives attention to the ACPI thermal relation table. There are no valid tables according to the dptfxtract program. Perhaps iasl should be used instead of dptfxtract as used in this section of Ubuntu Wiki. Or maybe I ran into the bug of the first program.

    Maybe I should turn acpi off. Absolutely accidentally stumbled upon it here. Considering this problem, I found this blog post.There is a question about acpi=off here. Also, I found ACPI thermal documentation and tutorial on how to debug ACPI tables. It might be useful.

  7. Also, there is an interesting post on kernel suspend bug and acpi drivers.

The second problem is that Linux does not spin the fan fully when overheating occurs (105 C). It is easier for it to shut down than to do extra cooling. I don't even have any suggestions here. :disappointed:

Maybe I should try a different distribution. Or try a different desktop environment, like Mate.

vladkinoman commented 4 years ago

Right before the shutdown I made the output of the dmesg app:

[  863.035805] [UFW BLOCK] IN=wlo1 OUT= MAC=bc:85:56:14:de:01:f4:f2:6d:f1:66:4a:08:00 SRC=198.252.206.25 DST=192.168.1.103 LEN=113 TOS=0x08 PREC=0x60 TTL=49 ID=21004 DF PROTO=TCP SPT=443 DPT=42452 WINDOW=61 RES=0x00 ACK PSH URGP=0 
[  901.917889] [UFW BLOCK] IN=wlo1 OUT= MAC=bc:85:56:14:de:01:f4:f2:6d:f1:66:4a:08:00 SRC=198.252.206.25 DST=192.168.1.103 LEN=113 TOS=0x08 PREC=0x60 TTL=49 ID=5236 DF PROTO=TCP SPT=443 DPT=42614 WINDOW=61 RES=0x00 ACK PSH URGP=0 
[  922.682699] [drm] PCIE gen 3 link speeds already enabled
[  922.691438] [drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
[  922.691528] radeon 0000:01:00.0: WB enabled
[  922.691531] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0x000000006e0a1d81
[  922.691532] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0x000000004a90d85b
[  922.691533] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0x00000000c77f2878
[  922.691534] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0x00000000ebee22d8
[  922.691535] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0x00000000e264e6f8
[  922.691754] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0x00000000a0346b3c
[  922.792277] radeon 0000:01:00.0: failed VCE resume (-110).
[  922.958682] [drm] ring test on 0 succeeded in 1 usecs
[  922.958687] [drm] ring test on 1 succeeded in 1 usecs
[  922.958690] [drm] ring test on 2 succeeded in 1 usecs
[  922.958697] [drm] ring test on 3 succeeded in 3 usecs
[  922.958702] [drm] ring test on 4 succeeded in 3 usecs
[  923.134685] [drm] ring test on 5 succeeded in 2 usecs
[  923.134691] [drm] UVD initialized successfully.
[  923.134720] [drm] ib test on ring 0 succeeded in 0 usecs
[  923.134762] [drm] ib test on ring 1 succeeded in 0 usecs
[  923.134794] [drm] ib test on ring 2 succeeded in 0 usecs
[  923.134847] [drm] ib test on ring 3 succeeded in 0 usecs
[  923.134876] [drm] ib test on ring 4 succeeded in 0 usecs
[  923.800664] [drm] ib test on ring 5 succeeded
[  947.518162] [drm] PCIE gen 3 link speeds already enabled
[  947.526891] [drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
[  947.526981] radeon 0000:01:00.0: WB enabled
[  947.526984] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0x000000006e0a1d81
[  947.526985] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0x000000004a90d85b
[  947.526986] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0x00000000c77f2878
[  947.526987] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0x00000000ebee22d8
[  947.526988] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0x00000000e264e6f8
[  947.527207] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0x00000000a0346b3c
[  947.627740] radeon 0000:01:00.0: failed VCE resume (-110).
[  947.794252] [drm] ring test on 0 succeeded in 1 usecs
[  947.794257] [drm] ring test on 1 succeeded in 1 usecs
[  947.794260] [drm] ring test on 2 succeeded in 1 usecs
[  947.794268] [drm] ring test on 3 succeeded in 3 usecs
[  947.794273] [drm] ring test on 4 succeeded in 3 usecs
[  947.970267] [drm] ring test on 5 succeeded in 2 usecs
[  947.970279] [drm] UVD initialized successfully.
[  947.970317] [drm] ib test on ring 0 succeeded in 0 usecs
[  947.970348] [drm] ib test on ring 1 succeeded in 0 usecs
[  947.970376] [drm] ib test on ring 2 succeeded in 0 usecs
[  947.970404] [drm] ib test on ring 3 succeeded in 0 usecs
[  947.970430] [drm] ib test on ring 4 succeeded in 0 usecs
[  948.632152] [drm] ib test on ring 5 succeeded
[  983.932387] [UFW BLOCK] IN=wlo1 OUT= MAC=bc:85:56:14:de:01:f4:f2:6d:f1:66:4a:08:00 SRC=198.252.206.25 DST=192.168.1.103 LEN=113 TOS=0x08 PREC=0x60 TTL=49 ID=21005 DF PROTO=TCP SPT=443 DPT=42452 WINDOW=61 RES=0x00 ACK PSH URGP=0 
[ 1019.024983] [drm] PCIE gen 3 link speeds already enabled
[ 1019.034332] [drm] PCIE GART of 2048M enabled (table at 0x00000000001D6000).
[ 1019.034426] radeon 0000:01:00.0: WB enabled
[ 1019.034428] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0x000000006e0a1d81
[ 1019.034430] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0x000000004a90d85b
[ 1019.034431] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0x00000000c77f2878
[ 1019.034432] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0x00000000ebee22d8
[ 1019.034432] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0x00000000e264e6f8
[ 1019.034652] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0x00000000a0346b3c
[ 1019.135218] radeon 0000:01:00.0: failed VCE resume (-110).
[ 1019.302566] [drm] ring test on 0 succeeded in 1 usecs
[ 1019.302571] [drm] ring test on 1 succeeded in 1 usecs
[ 1019.302574] [drm] ring test on 2 succeeded in 1 usecs
[ 1019.302580] [drm] ring test on 3 succeeded in 3 usecs
[ 1019.302585] [drm] ring test on 4 succeeded in 3 usecs
[ 1019.478561] [drm] ring test on 5 succeeded in 2 usecs
[ 1019.478567] [drm] UVD initialized successfully.
[ 1019.478608] [drm] ib test on ring 0 succeeded in 0 usecs
[ 1019.478635] [drm] ib test on ring 1 succeeded in 0 usecs
[ 1019.478695] [drm] ib test on ring 2 succeeded in 0 usecs
[ 1019.478762] [drm] ib test on ring 3 succeeded in 0 usecs
[ 1019.478791] [drm] ib test on ring 4 succeeded in 0 usecs
[ 1020.150800] [drm] ib test on ring 5 succeeded
[ 1177.170700] mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 1177.170701] mce: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
[ 1177.170703] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 1177.170704] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 1177.170705] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 1177.170707] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
[ 1177.171719] mce: CPU0: Package temperature/speed normal
[ 1177.171720] mce: CPU2: Core temperature/speed normal
[ 1177.171721] mce: CPU3: Core temperature/speed normal
[ 1177.171722] mce: CPU1: Package temperature/speed normal
[ 1177.171723] mce: CPU2: Package temperature/speed normal
[ 1177.171723] mce: CPU3: Package temperature/speed normal
[ 1194.393713] intel_powerclamp: Start idle injection to reduce power
[ 1201.947663] NOHZ: local_softirq_pending 202
[ 1201.951673] NOHZ: local_softirq_pending 202
[ 1201.952490] NOHZ: local_softirq_pending 202
[ 1201.952572] NOHZ: local_softirq_pending 202
[ 1201.955657] NOHZ: local_softirq_pending 202
[ 1201.959657] NOHZ: local_softirq_pending 202
[ 1201.963657] NOHZ: local_softirq_pending 202
[ 1201.967656] NOHZ: local_softirq_pending 202
[ 1203.923624] NOHZ: local_softirq_pending 202
[ 1203.927634] NOHZ: local_softirq_pending 202

Notice the next line:

[ 1194.393713] intel_powerclamp: Start idle injection to reduce power

I actually managed to see that the IDLE was stopped just before it shut down. Not that it shows anything. It is just interesting.

I still think it happened because of overheating at 105 C. However, at one point, the computer shut down at ~97. It may have actually been able to overheat at 105 C in a second. Or maybe it wasn't the CPU, it was the GPU. Although, for the GPU, the maximum temperature is 120 C. So maybe this is actually something third...

vladkinoman commented 4 years ago

Speaking of "no valid tables", people say this:

It is fine not to have a pre configured tables. Thermald will still work for CPU temperature control.

And it really works. Using this tutorial, I checked whether thermald worked on my PC:

- start service
    sudo systemctl start thermald.service
- Get status
    sudo systemctl status thermald.service
- Stop service
    sudo systemctl stop thermald.service

So, I get this:

$ sudo systemctl status thermald.service
● thermald.service - Thermal Daemon Service
     Loaded: loaded (/lib/systemd/system/thermald.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2020-08-14 14:03:35 EEST; 1h 35min left
   Main PID: 596 (thermald)
      Tasks: 2 (limit: 9268)
     Memory: 5.9M
     CGroup: /system.slice/thermald.service
             └─596 /usr/sbin/thermald --no-daemon --dbus-enable