void-linux / void-packages

The Void source packages collection
https://voidlinux.org
Other
2.6k stars 2.16k forks source link

Cannot shutdown AMD laptop with linux6.6-6.6.31_1 #50417

Closed Sqvid closed 4 months ago

Sqvid commented 6 months ago

Is this a new report?

Yes

System Info

Void 6.6.31_1 x86_64 AuthenticAMD uptodate rrFFFF

Package(s) Affected

linux6.6-6.6.31_1

Does a report exist for this bug with the project's home (upstream) and/or another distro?

No response

Expected behaviour

After runit winds down services the laptop should switch off.

Actual behaviour

Upgrading to kernel 6.6.31_1 prevents the computer from shutting down. Runit seems to stop all services correctly and the screen goes black as expected; however, the keyboard backlight and power light stay on and the fans still run.

System info:

CPU: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics GPU: Nvidia GeForce RTX 4060 (disabled and powered off) WM: Sway

Downgrading to 6.6.30_1 or below fixes this issue and the laptop shuts down immediately as expected.

Not sure how to extract useful debug info but would appreciate any tips.

Steps to reproduce

  1. Use effected kernel (though it may be hardware dependent).
  2. sudo poweroff now.
  3. Read runit messages of services being stopped.
  4. Screen goes black.
  5. Laptop stays on.
  6. Cry.
blacklightpy commented 6 months ago

I have kernel 6.6.29_1 and my system did not poweroff when I tried shutting it down with with my AGS widgets. I didn't debug whether it was the widgets or the system though, and I'm not in a mood to shutdown now.

1is7ac3 commented 6 months ago

I had this problem on my two Nvidia Optimus laptops: (Intel + Nvidia) and (Amd + Nvidia) while I was in the process of resuming the use of Void Linux, so it only gave me the error while I was using Nouveau. When I finished configuring the system with the Nvidia drivers, it no longer appeared.

funk443 commented 6 months ago

I encountered this problem after I updated the kernel this afternoon. Exactly the same problem like @Sqvid.

My system spec:

Since I usually clear caches after an update, so I cannot downgrade the kernel. But I tried installing linux6.9 and it works just fine.

tomalexander commented 6 months ago

I'm on arch linux but I have the same behavior. I went in my cache and reverted to test:

linux-lts-6.6.30-2-x86_64.pkg.tar.zst : shuts down fine linux-lts-6.6.31-1-x86_64.pkg.tar.zst : does not shut down

artluix commented 6 months ago

Facing same issue on Laptop with 8845hs during installation (no UI is installed)

thenbe commented 6 months ago

I found this thread through a search engine (I'm on nixOS not void linux), but I have the same issue on nixOS on a framework amd laptop (AMD Ryzen 7 7840U). Shutdown appears to work at first, where the systemd logs zoom right past like always. Then the screen goes black, but the power "stays on" (power led stays lit) indefinitely. I need to manually hold the power button for several seconds to force the shutdown to go through. Rolling back the kernel from 6.6.31 to 6.6.30 fixes the issue.

superm1 commented 6 months ago

Anyone affected by this can you check two things:

  1. Is 6.9.1 affected?
  2. If 6.9.1 isn't affected it's probably an incomplete backport. It's best to bisect 6.6.30 to 6.6.31 to find which commit caused it.
superm1 commented 6 months ago

CC @knurd

knurd commented 6 months ago

Warning, the following is a totally wild guess. But there was one report about a shutdown problem caused by a commit that went into 6.6.31 as well; wonder if that might be related. https://lore.kernel.org/all/CAE4VaREzY%2Ba2PvQJYJbfh8DwB4OP7kucZG-e28H22xyWob1w_A@mail.gmail.com/

lectrode commented 6 months ago

Bisected between 6.6.30 and 6.6.31 to find the problem commit:

$ git bisect start --no-checkout
status: waiting for both good and bad commits
$ git bisect good 5697d159afef8c475f13a0b7b85f09bd4578106c
status: waiting for bad commit, 1 good commit known
$ git bisect bad e3d332aaf898ed755b29c8cdf59be2cfba1cac4b
Bisecting: 154 revisions left to test after this (roughly 7 steps)
[6466a0f6d235c8a18c602cb587160d7e49876db9] uio_hv_generic: Don't free decrypted memory
$ git bisect good
Bisecting: 77 revisions left to test after this (roughly 6 steps)
[7a54e5052bde582fd0e7677334fe7a5be92e242c] usb: gadget: uvc: use correct buffer size when parsing configfs lists
$ git bisect bad
Bisecting: 38 revisions left to test after this (roughly 5 steps)
[2ee2fc6786bc5ff3c24798624ea3806c9662c26f] selftests/net: convert test_bridge_neigh_suppress.sh to run it in unique namespace
$ git bisect good
Bisecting: 19 revisions left to test after this (roughly 4 steps)
[7019a64165186fd3fb5ba928a6347558dc560093] net: dsa: mv88e6xxx: add phylink_get_caps for the mv88e6320/21 family
$ git bisect good
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[0c7ed3ed35eec9138b88d42217b5a6b9a62bda4d] drm/amdgpu: once more fix the call oder in amdgpu_ttm_move() v2
$ git bisect bad
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[95ca7c90eaf5ea8a8460536535101e3e81160e2a] gpiolib: cdev: Fix use after free in lineinfo_changed_notify
$ git bisect bad
Bisecting: 2 revisions left to test after this (roughly 1 step)
[fa2d2e2d8eae03acf49229793f2a6fddede92c4d] drm/meson: dw-hdmi: add bandgap setting for g12
$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[6c83a8f236ece78d5c2e60ae3dcfd1a64509410e] dm/amd/pm: Fix problems with reboot/shutdown for some SMU 13.0.4/13.0.11 users
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[1b9e8de483bcc230f6e922bdfa9d1c186c27dd3b] drm/connector: Add \n to message about demoting connector force-probes
$ git bisect good
6c83a8f236ece78d5c2e60ae3dcfd1a64509410e is the first bad commit
commit 6c83a8f236ece78d5c2e60ae3dcfd1a64509410e
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Thu May 2 13:32:17 2024 -0500

    dm/amd/pm: Fix problems with reboot/shutdown for some SMU 13.0.4/13.0.11 users

    [ Upstream commit cd94d1b182d2986378550c9087571991bfee01d4 ]

    Limit the workaround introduced by commit 31729e8c21ec ("drm/amd/pm: fixes
    a random hang in S4 for SMU v13.0.4/11") to only run in the s4 path.

    Cc: Tim Huang <Tim.Huang@amd.com>
    Fixes: 31729e8c21ec ("drm/amd/pm: fixes a random hang in S4 for SMU v13.0.4/11")
    Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3351
    Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0_4_ppt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
superm1 commented 6 months ago

Ok... Does it also fail on 6.9?

tomalexander commented 6 months ago

It's best to bisect 6.6.30 to 6.6.31 to find which commit caused it.

I have bisected it to this commit in https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux.git :

commit 6c83a8f236ece78d5c2e60ae3dcfd1a64509410e (HEAD)
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Thu May 2 14:32:17 2024

    dm/amd/pm: Fix problems with reboot/shutdown for some SMU 13.0.4/13.0.11 users

    [ Upstream commit cd94d1b182d2986378550c9087571991bfee01d4 ]

    Limit the workaround introduced by commit 31729e8c21ec ("drm/amd/pm: fixes
    a random hang in S4 for SMU v13.0.4/11") to only run in the s4 path.

    Cc: Tim Huang <Tim.Huang@amd.com>
    Fixes: 31729e8c21ec ("drm/amd/pm: fixes a random hang in S4 for SMU v13.0.4/11")
    Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3351
    Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>
lectrode commented 6 months ago

The issue is not present on 6.9.1. I would agree it looks like an incomplete/failed backport. @tomalexander good to have confirmation

superm1 commented 6 months ago

Interesting. It's supposed to be quite self contained.

Can you guys please confirm your firmware versions from debugfs?

/sys/kernel/debug/dri/0/amdgpu_firmware_info

tomalexander commented 6 months ago

Its going to take me a while to confirm 6.9.1 because the release version of zfs only supports up to 6.8. I'll have to try the upstream zfs git and if that doesn't work, swap out my SSD and make a new non-zfs install.

My firmware on Arch Linux's build of 6.6.31-1-lts:

$ doas cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 35, firmware version: 0x00000027
PFP feature version: 35, firmware version: 0x00000030
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x00000080
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x0000000f
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 35, firmware version: 0x00000027
IMU feature version: 0, firmware version: 0x0b012d00
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 553648339, firmware version: 0x210000d3
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x1700003e
TA DTM feature version: 0x00000000, firmware version: 0x12000016
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x004c5200 (76.82.0)
SDMA0 feature version: 60, firmware version: 0x00000011
VCN feature version: 0, firmware version: 0x0711300d
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x08003a00
TOC feature version: 0, firmware version: 0x0000000b
MES_KIQ feature version: 6, firmware version: 0x00000073
MES feature version: 1, firmware version: 0x00000059
VBIOS version: 113-PHXGENERIC-001
tomalexander commented 6 months ago

Nevermind, lucked out, upstream zfs git works for 6.9.1. I have confirmed that 6.9.1 shuts down properly.

lectrode commented 6 months ago

Firmware as reported on Manjaro running 6.6.30): (128 also exists under dri, but reports identical info)

# mount -t debugfs none /sys/kernel/debug
# cat /sys/kernel/debug/dri/1/amdgpu_firmware_info 
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 35, firmware version: 0x00000027
PFP feature version: 35, firmware version: 0x00000030
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x00000080
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x0000000f
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 35, firmware version: 0x00000027
IMU feature version: 0, firmware version: 0x0b012d00
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 553648339, firmware version: 0x210000d3
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x1700003e
TA DTM feature version: 0x00000000, firmware version: 0x12000016
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x004c5200 (76.82.0)
SDMA0 feature version: 60, firmware version: 0x00000011
VCN feature version: 0, firmware version: 0x0711300d
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x08003a00
TOC feature version: 0, firmware version: 0x0000000b
MES_KIQ feature version: 6, firmware version: 0x00000073
MES feature version: 1, firmware version: 0x00000059
VPE feature version: 0, firmware version: 0x00000000
VBIOS version: 113-PHXGENERIC-001
superm1 commented 6 months ago

Can you guys try this? Ideally please try it both on 6.9.y and 6.6.y:

0001-Add-hopefully-a-solution-for-shutdown-regression.PATCH

lectrode commented 6 months ago

That patch works on 6.6.31

Will update once I've tested on 6.9.1 and 6.1.91 (it is also an issue on 6.1.91 - have not tested older LTS kernels).

lectrode commented 6 months ago

Update: 6.9.1 continues to work with that patch, and 6.1.91 is fixed with it.

Also tested 5.10.217 and 5.15.159, but neither of those have the issue in question.

superm1 commented 6 months ago

Thanks! I've posted it to the mailing list for review:

https://lore.kernel.org/amd-gfx/20240526125908.2742-1-mario.limonciello@amd.com/T/#u

If anyone else wants to add A Reported-by or Tested-by tag, link or anything else please respond to that thread.

MIvanchev commented 6 months ago

I just wanted to report that this kernel also messed up the standby for me. I rarely shutdown the PC but stand by it regularly and now it works only 2 times and then reports "Another instance of zzz is already running".

superm1 commented 6 months ago

I just wanted to report that this kernel also messed up the standby for me. I rarely shutdown the PC but stand by it regularly and now it works only 2 times and then reports "Another instance of zzz is already running".

Does it bisect to same result? If so please try the fix linked above. If it's bisecting to different commit it should be different issue.

Piraty commented 4 months ago

should be fixed in

thanks @superm1 and all testers!