pop-os / pop

A project for managing all Pop!_OS sources
https://system76.com/pop
2.47k stars 87 forks source link

Kernel bug bringing down whole system: UBSAN: array-index-out-of-bounds #3238

Open obilodeau opened 8 months ago

obilodeau commented 8 months ago

Distribution (run cat /etc/os-release):

DISTRIB_ID=Pop
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Pop!_OS 22.04 LTS"

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

linux-image-generic: 6.6.10-76060610.202401051437~1704728131~22.04~24d69e2

Issue/Bug Description:

The network stops. GUI still somehow works but you can't launch new applications or do anything in existing applications. It's like if only basic GUI activities that don't involve kernel (or syscalls) work. It's unusable since you can't do anything but the mouse moves and you can alt-tab.

This then leads to "soft lockup" where a given CPU is stuck.

The only way out is a reboot.

iwlwifi problem log:

Mar 01 10:24:20 kernel: ================================================================================
Mar 01 10:24:20 kernel: UBSAN: array-index-out-of-bounds in /build/linux-CeJFpv/linux-6.6.10/drivers/net/wireless/intel/iwlwifi/queue/tx.c:1580:39
Mar 01 10:24:20 kernel: index 512 is out of range for type 'iwl_txq *[512]'
Mar 01 10:24:20 kernel: CPU: 3 PID: 1000 Comm: irq/189-iwlwifi Tainted: P        W  OE      6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2
Mar 01 10:24:20 kernel: Hardware name: Dell Inc. XPS 15 9530/0N92RM, BIOS 1.9.0 11/13/2023
Mar 01 10:24:20 kernel: Call Trace:
Mar 01 10:24:20 kernel:  <IRQ>
Mar 01 10:24:20 kernel:  dump_stack_lvl+0x48/0x70
Mar 01 10:24:20 kernel:  dump_stack+0x10/0x20
Mar 01 10:24:20 kernel:  __ubsan_handle_out_of_bounds+0xc6/0x110
Mar 01 10:24:20 kernel:  iwl_txq_reclaim+0x5bd/0x5d0 [iwlwifi]
Mar 01 10:24:20 kernel:  ? ttwu_queue_wakelist+0x135/0x1c0
Mar 01 10:24:20 kernel:  iwl_mvm_rx_tx_cmd_single+0xf0/0xa30 [iwlmvm]
Mar 01 10:24:20 kernel:  ? iwl_mvm_rx_tx_cmd_single+0xf0/0xa30 [iwlmvm]
Mar 01 10:24:20 kernel:  iwl_mvm_rx_tx_cmd+0x17e/0x1f0 [iwlmvm]
Mar 01 10:24:20 kernel:  ? iwl_mvm_rx_tx_cmd+0x17e/0x1f0 [iwlmvm]
Mar 01 10:24:20 kernel:  iwl_mvm_rx_common+0x13d/0x4d0 [iwlmvm]
Mar 01 10:24:20 kernel:  iwl_mvm_rx_mq+0x7e/0x120 [iwlmvm]
Mar 01 10:24:20 kernel:  iwl_pcie_rx_handle_rb.constprop.0+0xb4/0x530 [iwlwifi]
Mar 01 10:24:20 kernel:  iwl_pcie_rx_handle+0x20b/0x640 [iwlwifi]
Mar 01 10:24:20 kernel:  iwl_pcie_napi_poll_msix+0x32/0x100 [iwlwifi]
Mar 01 10:24:20 kernel:  __napi_poll+0x30/0x1f0
Mar 01 10:24:20 kernel:  net_rx_action+0x181/0x2e0
Mar 01 10:24:20 kernel:  __do_softirq+0xd9/0x349
Mar 01 10:24:20 kernel:  ? __pfx_irq_thread_fn+0x10/0x10
Mar 01 10:24:20 kernel:  do_softirq.part.0+0x41/0x80
Mar 01 10:24:20 kernel:  </IRQ>
Mar 01 10:24:20 kernel:  <TASK>
Mar 01 10:24:20 kernel:  __local_bh_enable_ip+0x72/0x80
Mar 01 10:24:20 kernel:  iwl_pcie_irq_rx_msix_handler+0xd7/0x1a0 [iwlwifi]
Mar 01 10:24:20 kernel:  irq_thread_fn+0x21/0x70
Mar 01 10:24:20 kernel:  irq_thread+0xf8/0x1c0
Mar 01 10:24:20 kernel:  ? __pfx_irq_thread_dtor+0x10/0x10
Mar 01 10:24:20 kernel:  ? __pfx_irq_thread+0x10/0x10
Mar 01 10:24:20 kernel:  kthread+0xef/0x120
Mar 01 10:24:20 kernel:  ? __pfx_kthread+0x10/0x10
Mar 01 10:24:20 kernel:  ret_from_fork+0x44/0x70
Mar 01 10:24:20 kernel:  ? __pfx_kthread+0x10/0x10
Mar 01 10:24:20 kernel:  ret_from_fork_asm+0x1b/0x30
Mar 01 10:24:20 kernel:  </TASK>
Mar 01 10:24:20 kernel: ================================================================================
Mar 01 10:24:21 kernel: sched: RT throttling activated
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: Error sending SCAN_CFG_CMD: time out after 2000ms.
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: Current CMD queue read_ptr 6748 write_ptr 6749
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: Start IWL Error Log Dump:
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: Transport status: 0x0000004A, valid: 6
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: Loaded firmware version: 83.e8f84e98.0 so-a0-gf-a0-83.ucode
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000084 | NMI_INTERRUPT_UNKNOWN       
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00008210 | trm_hw_status0
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000000 | trm_hw_status1
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x004DB338 | branchlink2
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x004D119A | interruptlink1
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x004D119A | interruptlink2
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x0000B790 | data1
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x01000000 | data2
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000000 | data3
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x368152C3 | beacon time
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x2B567D12 | tsf low
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x0000010B | tsf hi
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000000 | time gp1
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0xC99B2B18 | time gp2
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000001 | uCode revision type
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000053 | uCode version major
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0xE8F84E98 | uCode version minor
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00000370 | hw version
Mar 01 10:24:23 kernel: iwlwifi 0000:00:14.3: 0x00480002 | board version
[...]

soft lockup (with two lines of context on top to see the time gap with previous problem):

Mar 01 10:24:23 kernel: ieee80211 phy0: Hardware restart was requested
Mar 01 10:24:30 kernel: iwlwifi 0000:00:14.3: Queue 4 is stuck 55356 55656
Mar 01 10:24:46 kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [irq/189-iwlwifi:1000]
Mar 01 10:24:46 kernel: Modules linked in: hid_plantronics hid_logitech_hidpp hid_logitech_dj tls ccm rfcomm cmac algif_hash algif_skcipher af_alg snd_seq_dummy snd_hrtimer zstd nvidia_uvm(POE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) snd_ctl_led snd_soc_skl_hda_dsp snd_soc_intel_hda_dsp_common snd_soc_hdac_hdmi snd_sof_probes snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_soc_dmic snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_generic_allocation soundwire_bus bnep zram snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine intel_uncore_frequency snd_hda_intel intel_uncore_frequency_common snd_intel_dspcfg x86_pkg_temp_thermal snd_intel_sdw_acpi intel_powerclamp snd_hda_scodec_cs35l41_spi dell_laptop coretemp snd_hda_codec snd_hda_scodec_cs35l41_i2c snd_hda_scodec_cs35l41 uvcvideo snd_hda_cs_dsp_ctls
Mar 01 10:24:46 kernel:  btusb pmt_telemetry iwlmvm snd_usb_audio kvm_intel snd_hda_core videobuf2_vmalloc btrtl cs_dsp snd_usbmidi_lib uvc mei_hdcp mei_pxp intel_rapl_msr pmt_class nvidia_drm(POE) videobuf2_memops binfmt_misc snd_soc_cs35l41_lib snd_ump snd_hwdep btintel mac80211 kvm nvidia_modeset(POE) videobuf2_v4l2 snd_seq_midi btbcm hid_sensor_als snd_seq_midi_event btmtk dell_wmi libarc4 bluetooth snd_rawmidi iwlwifi videodev cmdlinepart snd_seq hid_sensor_trigger dell_smbios snd_pcm spi_nor industrialio_triggered_buffer videobuf2_common irqbypass iTCO_wdt mei_me kfifo_buf snd_seq_device ecdh_generic dcdbas hid_sensor_iio_common intel_pmc_bxt nvidia(POE) nls_iso8859_1 bfq joydev snd_timer mc cfg80211 rapl ecc input_leds processor_thermal_device_pci industrialio iTCO_vendor_support mei mtd hid_multitouch dell_wmi_sysman snd processor_thermal_device ledtrig_audio serio_raw dell_wmi_ddv dell_wmi_descriptor intel_cstate processor_thermal_rfim wmi_bmof firmware_attributes_class processor_thermal_mbox soundcore
Mar 01 10:24:46 kernel:  processor_thermal_rapl intel_rapl_common intel_vsec serial_multi_instantiate dptf_power int3403_thermal int340x_thermal_zone mac_hid intel_hid int3400_thermal acpi_thermal_rel sparse_keymap acpi_tad acpi_pad sch_fq_codel kyber_iosched msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables autofs4 usbhid dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear system76_io(OE) system76_acpi(OE) hid_sensor_custom hid_sensor_hub intel_ishtp_hid nvme nvme_core nvme_common spi_pxa2xx_platform ahci dw_dmac dw_dmac_core hid_generic libahci i915 drm_buddy i2c_algo_bit crct10dif_pclmul ttm crc32_pclmul polyval_clmulni drm_display_helper polyval_generic ghash_clmulni_intel sha256_ssse3 cec sha1_ssse3 i2c_hid_acpi aesni_intel rc_core rtsx_pci_sdmmc intel_lpss_pci i2c_hid ucsi_acpi drm_kms_helper intel_ish_ipc crypto_simd hid typec_ucsi xhci_pci spi_intel_pci i2c_i801 intel_lpss cryptd video psmouse i2c_smbus rtsx_pci spi_intel
Mar 01 10:24:46 kernel:  thunderbolt intel_ishtp idma64 vmd xhci_pci_renesas typec drm wmi pinctrl_tigerlake
Mar 01 10:24:46 kernel: CPU: 3 PID: 1000 Comm: irq/189-iwlwifi Tainted: P        W  OE      6.6.10-76060610-generic #202401051437~1704728131~22.04~24d69e2
Mar 01 10:24:46 kernel: Hardware name: Dell Inc. XPS 15 9530/0N92RM, BIOS 1.9.0 11/13/2023
Mar 01 10:24:46 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x83/0x300
Mar 01 10:24:46 kernel: Code: 00 00 f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 61 85 c0 74 10 0f b6 03 84 c0 74 09 f3 90 <0f> b6 03 84 c0 75 f7 b8 01 00 00 00 66 89 03 5b 41 5c 41 5d 41 5e
Mar 01 10:24:46 kernel: RSP: 0018:ffffc90000564af8 EFLAGS: 00000206
Mar 01 10:24:46 kernel: RAX: 0000000000000006 RBX: ffff8881064dc4a0 RCX: 0000000000000000
Mar 01 10:24:46 kernel: RDX: 0000000000000000 RSI: 0000000000000006 RDI: ffff8881064dc4a0
Mar 01 10:24:46 kernel: RBP: ffffc90000564b20 R08: 0000000000000000 R09: 0000000000000000
Mar 01 10:24:46 kernel: R10: 0000000000000000 R11: 0000000000000801 R12: ffffc90000564c50
Mar 01 10:24:46 kernel: R13: ffff88810e160028 R14: ffff8881064dc480 R15: 0000000000000200
Mar 01 10:24:46 kernel: FS:  0000000000000000(0000) GS:ffff88904ef80000(0000) knlGS:0000000000000000
Mar 01 10:24:46 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 01 10:24:46 kernel: CR2: 00002b480aedd400 CR3: 0000000131e02000 CR4: 0000000000f52ee0
Mar 01 10:24:46 kernel: PKRU: 55555554
Mar 01 10:24:46 kernel: Call Trace:
Mar 01 10:24:46 kernel:  <IRQ>
Mar 01 10:24:46 kernel:  ? show_regs+0x6d/0x80
Mar 01 10:24:46 kernel:  ? watchdog_timer_fn+0x1d8/0x240
Mar 01 10:24:46 kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Mar 01 10:24:46 kernel:  ? __hrtimer_run_queues+0x10f/0x2a0
Mar 01 10:24:46 kernel:  ? clockevents_program_event+0xb3/0x140
Mar 01 10:24:46 kernel:  ? hrtimer_interrupt+0xf6/0x250
Mar 01 10:24:46 kernel:  ? __sysvec_apic_timer_interrupt+0x4e/0x150
Mar 01 10:24:46 kernel:  ? sysvec_apic_timer_interrupt+0x3b/0xd0
Mar 01 10:24:46 kernel:  ? asm_sysvec_apic_timer_interrupt+0x1b/0x20
Mar 01 10:24:46 kernel:  ? native_queued_spin_lock_slowpath+0x83/0x300
Mar 01 10:24:46 kernel:  ? __ubsan_handle_out_of_bounds+0xee/0x110
Mar 01 10:24:46 kernel:  _raw_spin_lock_bh+0x43/0x60
Mar 01 10:24:46 kernel:  iwl_txq_reclaim+0xad/0x5d0 [iwlwifi]
Mar 01 10:24:46 kernel:  ? ttwu_queue_wakelist+0x135/0x1c0
Mar 01 10:24:46 kernel:  iwl_mvm_rx_tx_cmd_single+0xf0/0xa30 [iwlmvm]
Mar 01 10:24:46 kernel:  ? iwl_mvm_rx_tx_cmd_single+0xf0/0xa30 [iwlmvm]
Mar 01 10:24:46 kernel:  iwl_mvm_rx_tx_cmd+0x17e/0x1f0 [iwlmvm]
Mar 01 10:24:46 kernel:  ? iwl_mvm_rx_tx_cmd+0x17e/0x1f0 [iwlmvm]
Mar 01 10:24:46 kernel:  iwl_mvm_rx_common+0x13d/0x4d0 [iwlmvm]
Mar 01 10:24:46 kernel:  iwl_mvm_rx_mq+0x7e/0x120 [iwlmvm]
Mar 01 10:24:46 kernel:  iwl_pcie_rx_handle_rb.constprop.0+0xb4/0x530 [iwlwifi]
Mar 01 10:24:46 kernel:  iwl_pcie_rx_handle+0x20b/0x640 [iwlwifi]
Mar 01 10:24:46 kernel:  iwl_pcie_napi_poll_msix+0x32/0x100 [iwlwifi]
Mar 01 10:24:46 kernel:  __napi_poll+0x30/0x1f0
Mar 01 10:24:46 kernel:  net_rx_action+0x181/0x2e0
Mar 01 10:24:46 kernel:  __do_softirq+0xd9/0x349
Mar 01 10:24:46 kernel:  ? __pfx_irq_thread_fn+0x10/0x10
Mar 01 10:24:46 kernel:  do_softirq.part.0+0x41/0x80
Mar 01 10:24:46 kernel:  </IRQ>
Mar 01 10:24:46 kernel:  <TASK>
Mar 01 10:24:46 kernel:  __local_bh_enable_ip+0x72/0x80
Mar 01 10:24:46 kernel:  iwl_pcie_irq_rx_msix_handler+0xd7/0x1a0 [iwlwifi]
Mar 01 10:24:46 kernel:  irq_thread_fn+0x21/0x70
Mar 01 10:24:46 kernel:  irq_thread+0xf8/0x1c0
Mar 01 10:24:46 kernel:  ? __pfx_irq_thread_dtor+0x10/0x10
Mar 01 10:24:46 kernel:  ? __pfx_irq_thread+0x10/0x10
Mar 01 10:24:46 kernel:  kthread+0xef/0x120
Mar 01 10:24:46 kernel:  ? __pfx_kthread+0x10/0x10
Mar 01 10:24:46 kernel:  ret_from_fork+0x44/0x70
Mar 01 10:24:46 kernel:  ? __pfx_kthread+0x10/0x10
Mar 01 10:24:46 kernel:  ret_from_fork_asm+0x1b/0x30
Mar 01 10:24:46 kernel:  </TASK>

Steps to reproduce (if you know):

So far it only happened to me on video calls. It happens every other day so I'll know when it's fixed. It's a new install.

Expected behavior:

No crash

Other Notes:

leviport commented 8 months ago

I think that message is the same as https://github.com/pop-os/linux/issues/285

But that bug wasn't causing other issues, so I think it's unrelated to the issues you're running into.

obilodeau commented 8 months ago

I did find that similar bug. Unfortunately, I do think it's related. 2 times out of 2 I saw that stack trace my network went down and the CPU went soft lockup.

I have the unrelated VirtualBox array-out-of-bound that isn't a problem. That happens at module load time not at run-time.

obilodeau commented 8 months ago

I have been running version 6.6.10.76060610.202401051437~1709085277~22.04~31d73d8 for a week and no crash since then.

leviport commented 8 months ago

That's good to hear, but that's basically the same kernel. The version only changed because we added audio fixups for a couple new products we're about to launch. We do have a 6.8 version in testing currently though: https://github.com/pop-os/linux/pull/301