void-linux / void-packages

The Void source packages collection
https://voidlinux.org
Other
2.5k stars 2.11k forks source link

screen frozen linux6.6 and 6.7 #48473

Open levdopa opened 7 months ago

levdopa commented 7 months ago

Is this a new report?

Yes

System Info

Void 6.6.11_1 x86_64 musl

Package(s) Affected

linux6.6 linux6.7

Does a report exist for this bug with the project's home (upstream) and/or another distro?

No response

Expected behaviour

I turn on my computer (alienware m15 r4 i7-10870H rtx3060) void boots like normal

Actual behaviour

IMG_8873

I am permanently stuck on this screen after grub.

Booting from 6.5.13 works normal.

Booting from 6.7.2 is broken.

Steps to reproduce

Install void using either chroot or void-installer (I did both) sign in as root xbps-install -Suy xbps xbps-install -Suy reboot

and I am now stuck. (Literally no other commands, just those)

abenson commented 7 months ago

What troubleshooting steps have you taken? From the description, it sounds like you're using nouveau?

What happens if you boot with nomodeset=1?

levdopa commented 7 months ago

The problem happens both when nouveau is installed or not installed.

I tried blacklisting nouveau in /etc/modprobe.d/blacklist.conf it is still frozen

I tried booting with nomodset=1 with nouveau both installed and uninstalled both are still frozen

classabbyamp commented 7 months ago

nouveau can't be not installed, it's a kernel module

levdopa commented 7 months ago

I mean xf86-video-nouveau, I just shortened it to nouveau

nau5ea commented 7 months ago

I experienced this issue on my Core 2 Duo machine with no GPU on linux6.6.11 as well

loukamb commented 7 months ago

I had this problem a month-ish ago and it was due to nouveau. Blacklisting the module correctly prior to a system upgrade completely fixes the problem, but you will have to use proprietary drivers as a replacement until this is fixed (unless you don't care about not having proper drivers). You can see the relevant discussion here: https://old.reddit.com/r/voidlinux/comments/18w0mq9/upgrade_to_kernel_668_hangs_the_os_at_boot/

I was able to solve this by booting from the latest ISO, installing the system from local packages, blacklisting nouveau through dracut and the other initramfs mechanism (you can find instructions for both from the handbook), then performing a full system upgrade. The upgrade automatically rebuilds the ramfs configuration, so there's no need to do anything manually. That worked across multiple installations as well. As for why blacklisting didn't work for you earlier, make sure to blacklist before installing the new kernel and doing a system upgrade, or otherwise you will have to reconfigure ramfs manually.

Sapein commented 7 months ago

I'm having a similar -- if not the same -- issue on my install as well.

This issue does not occur on Kernels 6.5.5_2, 6.5.12_1, and 6.1.29_1. It does occur on the 6.6.8, 6.6.11, and 6.6.16. At least from my testing.

doing nomodeset=1 does allow the system to boot, but it breaks anything graphical, treating my two monitors as one and not allowing me to actually set a resolution with X, and sway still refuses to start (where-as on the 6.5 kernels it does start. It also does not start on the 6.1.29 kernel but with a different error).

My system information is as follows:

I included the CPU as it does not include integrated graphics, IIRC.

Using the Proprietary Nvidia drivers does work, but I ran into the issue because sway won't start with them it seems, so I wanted to switch to the Nouveau ones to try out sway.

Brixy commented 6 months ago

Hi guys,

Experienced the same issue.

I ran the live image, mounted my partitions and installed an older kernel series (6.5)

Then ran

xbps-reconfigure -fa
update-grub

This solved the problem by chance. Maybe the kernel installation did some clever trick?! Anyway, void now also boots with new kernels; tested with 6.6.21.

thomasxg commented 5 months ago

I have the same issue when installing/running Void, so I'm stuck on the 6.5.13 kernel for now (NVIDIA GTX 1060 3G).

I did manage to login to the "frozen" machine via ssh and take a look at the log. I'll post the trace below from booting the 6.8.1 kernel, I get the same error from the 6.6.22 kernel:

[    2.231417] ------------[ cut here ]------------
[    2.231418] kernel BUG at include/linux/scatterlist.h:187!
[    2.231422] fbcon: Taking over console
[    2.231429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    2.231432] CPU: 3 PID: 333 Comm: systemd-udevd Not tainted 6.8.1_1 #1
[    2.231436] Hardware name: System manufacturer System Product Name/H170I-PRO, BIOS 3805 05/16/2018
[    2.231440] RIP: 0010:sg_init_one+0x77/0x80
[    2.231446] Code: 00 01 83 e1 03 a8 03 75 23 83 e2 01 75 20 48 09 c8 41 89 6c 24 08 49 89 04 24 41 89 5c 24 0c 5b 5d 41 5c 41 5d c3 cc cc cc cc <0f> 0b 0f 0b 0f 0b 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90
[    2.231452] RSP: 0018:ffffb221804c78d0 EFLAGS: 00010246
[    2.231456] RAX: 0000000000000000 RBX: 0000000000005000 RCX: 0000000000000027
[    2.231459] RDX: 0000000000000036 RSI: 0000000000000000 RDI: ffffb22200599000
[    2.231462] RBP: 0000000000005000 R08: 0000000000000000 R09: 0000000000000000
[    2.231466] R10: ffff96bf02610058 R11: 0000000000000000 R12: ffff96bf02610058
[    2.231469] R13: ffffb22180599000 R14: ffffb22180265000 R15: ffffb22180265100
[    2.231472] FS:  00007fa35e14e740(0000) GS:ffff96c226d80000(0000) knlGS:0000000000000000
[    2.231476] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.231479] CR2: 00007ffe8930eff8 CR3: 0000000100d5e005 CR4: 00000000003706f0
[    2.231482] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.231485] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.231488] Call Trace:
[    2.231491]  <TASK>
[    2.231493]  ? die+0x36/0x90
[    2.231498]  ? do_trap+0xda/0x100
[    2.231501]  ? sg_init_one+0x77/0x80
[    2.231505]  ? do_error_trap+0x6a/0x90
[    2.231508]  ? sg_init_one+0x77/0x80
[    2.231511]  ? exc_invalid_op+0x50/0x70
[    2.231515]  ? sg_init_one+0x77/0x80
[    2.231518]  ? asm_exc_invalid_op+0x1a/0x20
[    2.231524]  ? sg_init_one+0x77/0x80
[    2.231529]  nvkm_firmware_ctor+0x1fd/0x260 [nouveau]
[    2.231657]  nvkm_falcon_fw_ctor_hs+0x113/0x360 [nouveau]
[    2.231768]  gm200_acr_hsfw_ctor+0xce/0xf0 [nouveau]
[    2.231878]  gp102_acr_load+0x206/0x370 [nouveau]
[    2.231989]  nvkm_acr_new_+0x208/0x2f0 [nouveau]
[    2.232098]  nvkm_device_ctor+0xd74/0x4610 [nouveau]
[    2.232239]  nvkm_device_pci_new+0x101/0x2c0 [nouveau]
[    2.232379]  nouveau_drm_probe+0xd5/0x280 [nouveau]
[    2.232513]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[    2.232518]  local_pci_probe+0x42/0xa0
[    2.232522]  pci_device_probe+0xc1/0x220
[    2.232527]  really_probe+0x19b/0x3e0
[    2.232532]  ? __pfx___driver_attach+0x10/0x10
[    2.232536]  __driver_probe_device+0x78/0x160
[    2.232540]  driver_probe_device+0x1f/0x90
[    2.232545]  __driver_attach+0xd2/0x1c0
[    2.232549]  bus_for_each_dev+0x85/0xd0
[    2.232553]  bus_add_driver+0x116/0x220
[    2.232557]  driver_register+0x59/0x100
[    2.232561]  ? __pfx_nouveau_drm_init+0x10/0x10 [nouveau]
[    2.232662]  do_one_initcall+0x58/0x320
[    2.232668]  do_init_module+0x60/0x240
[    2.232672]  __do_sys_init_module+0x17f/0x1b0
[    2.232677]  do_syscall_64+0x88/0x180
[    2.232681]  ? fpregs_assert_state_consistent+0x26/0x50
[    2.232687]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[    2.232691] RIP: 0033:0x7fa35e365c9a
[    2.232695] Code: 48 8b 0d 91 21 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5e 21 0d 00 f7 d8 64 89 01 48
[    2.232701] RSP: 002b:00007ffe89323e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    2.232706] RAX: ffffffffffffffda RBX: 00007fa35ce00010 RCX: 00007fa35e365c9a
[    2.232709] RDX: 00007fa35e45aafd RSI: 0000000000730769 RDI: 00007fa35ce00010
[    2.232712] RBP: 0000555f37eb4570 R08: 0000000000007b80 R09: 0000000000000000
[    2.232715] R10: 00007fa35e438b20 R11: 0000000000000246 R12: 00007fa35e45aafd
[    2.232718] R13: 0000000000020000 R14: 0000555f37eaab00 R15: 0000000000000001
[    2.232723]  </TASK>
[    2.232725] Modules linked in: sd_mod nouveau(+) drm_gpuvm drm_exec gpu_sched i2c_algo_bit drm_display_helper cec ahci crct10dif_pclmul libahci rc_core crc32_pclmul polyval_clmulni xhci_pci polyval_generic libata drm_kms_helper gf128mul ghash_clmulni_intel xhci_pci_renesas sha512_ssse3 drm_ttm_helper sha256_ssse3 ttm sha1_ssse3 mxm_wmi aesni_intel agpgart xhci_hcd crypto_simd drm scsi_mod cryptd usbcore usb_common scsi_common video wmi button dm_mirror dm_region_hash dm_log dm_mod btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_intel
[    2.232773] ---[ end trace 0000000000000000 ]---

Perhaps someone else can confirm the same error on their machine, it seems to be something related to firmware loading?

6.8.1.log

loukamb commented 5 months ago

@thomasxg I no longer have it but when I faced this issue months ago it had a very similar trace to what you just posted. The issue is definitely GPU-related, and doesn't happen on proprietary drivers, only nouveau.

With the release of the new live image on March 14 however the problem has become worse and could impact adoption. With the previous live image, you could boot into live Void without any issues, install the OS to disk using local packages, reboot then blacklist nouveau right before performing an update. That's what I did to get Void working on my Nvidia machine (2080 Ti). However, the new live image rolls in the regression, meaning you cannot even reach a terminal after GRUB without explicitly blacklisting the nouveau driver and modeset through kernel flags when launching the live image. I had to do this recently when I reinstalled Void, which is practically simple but difficult to figure out when you don't even have logs or any output telling you what's erroring.

nezos commented 4 months ago

Had the same problem hanging at "loading initial ramdisk".

My solution was to install the mainline kernel and the nvidia non-free driver. When I installed only the mainline the problem went down to nouveau, so the nvidia driver fixed it completely.

loukamb commented 3 weeks ago

Just pointing out that this issue STILL exists on latest kernel and nouveau. What the heck.

classabbyamp commented 3 weeks ago

does nouveau.noaccel=1 fix the issue? (or does anything else in this thread look relevant to this? https://gitlab.freedesktop.org/drm/nouveau/-/issues/319#note_2521669)

bisecting the kernel with void's kconfig would be very useful

classabbyamp commented 3 weeks ago

one user reported arch's livecd worked fine. the only difference between arch's and void's kconfig w/r/t nouveau is CONFIG_DRM_NOUVEAU_SVM=y on arch (=n on void). might be related