raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.11k stars 4.98k forks source link

3B+ Rev 1.3 (5.15.84-v8+): AP mode through NetworkManager triggers kernel panic. #5380

Closed fkelava closed 1 year ago

fkelava commented 1 year ago

Describe the bug

With the latest release of Raspbian 11 Lite, when I set up the 3B+'s onboard WiFi to act as an AP and try to connect to it, I instantly experience a kernel panic. I am not sure what to ascribe it to, although at this point I have tried:

1) Replacing the power source. I have tried multiple 5V3A adapters and even a up-to-12V5A bench power supply. I do not see any undervoltage errors in my logs that would indicate insufficient power. vcgencmd get_throttled returns a firm 0x0. 2) Removing the add-ons I have bolted onto the Pi. I have a Sixfab 3G/4G Base HAT with Quectel EC25 modem and a Seeed dual-CAN FD hat. The addition or absence of either of these has no impact on the recurrence rate. Additionally, each has been tested to work fine standalone. 3) Scouring /var/log/kern.log for some indication as to what has failed, applying brcmfmac.debug=0x100000 in /boot/cmdline.txt. It reports errors, but I don't know what they mean nor how I can fix them. They are attached below.

Steps to reproduce the behaviour

  1. Perform a completely fresh install of 2023-02-21-raspios-bullseye-arm64-lite.
  2. Upon first boot, sudo raspi-config, and swap the network stack to NetworkManager. Reboot when prompted.
  3. a) Perform a sudo apt update && sudo apt upgrade to ensure latest packages. However, I can trigger the panic all the same without doing so.
  4. b) As I have a USB LTE board connected, I initialized it with sudo mmcli -m 0 -e, nmcli c add type gsm ifname 'cdc-wdm0' con-name <LTE_NAME> apn <LTE_APN>, and nmcli r wwan on. However, I can trigger the panic all the same without doing so.
  5. nmcli d wifi hotspot ifname wlan0 ssid <SSID> password <password>
  6. nmcli c modify <HOTSPOT_CONN_NAME> connection.autoconnect yes
  7. Reboot if desired. Then connect to said AP.
  8. Upon connection, the kernel will panic. This happens in approximately 95 of 100 connection attempts, and is instant- it does not fail at some unspecified later point, but instantly upon connection.

Device (s)

Raspberry Pi 3 Mod. B+

System

uname -a:

Linux telemetry 5.15.84-v8+ #1613 SMP PREEMPT Thu Jan 5 12:03:08 GMT 2023 aarch64 GNU/Linux

vcgencmd version:

Feb 22 2023 10:48:01
Copyright (c) 2012 Broadcom
version 74a4b109e7f5be465332a1f102649d34f8498d05 (clean) (release) (start)

cat /etc/rpi-issue:

Raspberry Pi reference 2023-02-21
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 25e2319effa91eb95edd9d9208eb9f8a584d67be, stage2

Logs

The panic stack trace itself is:

peppy@telemetry:~ $ [  148.747313] Unable to handle kernel paging request at virtual address 0000401f38c9ef2b 
[ 148.747529] Mem abort info:
[ 148.747611]   ESR = 0x96000004
[ 148.747750]   EC = 0x25: DABT (current EL), IL = 32 bits 
[ 148.747884]   SET = 0, FnV = 0
[ 148.747971]   EA = 0, S1PTW = 0
[ 148.748069]   FSC = 0x04: level 0 translation fault
[ 148.748190] Data abort info:
[ 148.748270]   ISV = 0, ISS = 0x00000004
[ 148.748370]   CM = 0, WnR = 0
[ 148.748455] [0000401f38c9ef2b] address between user and kernel address ranges
[ 148.748631] Internal error: Oops: 96000004 [81] PREEMPT SMP
[ 148.748761] Modules linked in: cmac algif_hash aes_arm64 aes_generic algif_skcipher af_alg bnep hci_uart btbcm bluetooth ecdh_generic ecc libaes nft_chain_nat xt_MASQUERADE xt_state xt_conntrack ipt_REJECT nf_reject_ipv4 nft_counter xt_t cpudp nft_compat nf_tables nfnetlink nf_nat_h323 nf_conntrack_h323 nf_nat_pptp nf_conntrack_pptp nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 option qmi_wwan usb_wwan cdc_wdm usbserial vc4 snd_soc_hdmi_codec cec drm_kms_helper snd_soc_core brcmfmac snd_compress brcmutil snd_pcm_dmaengine syscopyarea sysfillrect cfg80211 sysimgblt fb_sys_fops raspberrypi_hwmon bcm2835_v4l2(C) bcm2835_isp(C) bcm2835_codec(C) v4l2_mem2mem bcm2835_mmal_vchiq(C) i2c_bcm2835 videobuf2_dma_contig videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 rfkill videobuf2_common snd_bcm2835(C) snd_pcm videodev snd_timer snd mc vc_sm_cma(C) uio_pdrv_genirq uio 
[ 148.749307] drm fuse drm_panel_orientation_quirks backlight iptables x_tables ipv6
[ 148.751138] CPU: 0 PID: 453 Conn: in:imklog Tainted: G C 5.15.84-v8+ #1613
[ 148.751319] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
[ 148.751455] pstate: 40000005 (nZcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) 
[ 148.751611] pc : skb_release_data+0xa4/0x190
[ 148.751726] lr : skb_release_all+0x30/0x40
[ 148.751827] sp : ffffffc0089039c0
[ 148.751986] x29: ffffffc0089039c0 x28: ffffff803b189c50 x27: ffffff803b189b50
[ 148.752876] x26: 0000000000000001 x25: 00000000000000e0 x24: ffffff803b189c4c
[ 148.752241] x23: 0000000000000001 x22: ffffff8005cf33c0 x21: ffffff8003d2d500
[ 148.752411] x20: ffffff8005cf33f0 x19: 0000000000000000 x18: 0000000000000010 
[ 148.752579] x17: 667820676e696d6d x16: 6972743a3135373a x15: ffffff8007ac0400
[ 148.752747] x14: 6f635f726566785f x13: 65746174735f6272 x12: ffffffe01f5b6660
[ 148.759119] x11: 0000000000000003 x10: ffffffe01f59e620 x9 : ffffffe01eb639d8
[ 148.7655021 x8 : 0000000000000000 x7 : 0000000000000040 x6 : ffffff8005cf2e40
[ 148.771858] x5 : ffffffa01bf2a000 x4 : 0000000000000101 x3 : ffffff8006f7bc80 
[ 148.778196] x2 : ffffff8005cf33c0 x1 : ffffff8005cf2e00 x0 : 3100101f38c9ef23 
[ 148.784535) Call trace:
[ 148.790858]  skb_release_data+0xa4/0x190 
[ 148.797218]  skb_release_all+0x30/0x40 
[ 148.803543]  kfree_skb_reason+0x60/0x120
[ 148.809898]  ip_rcv_core.isra.26+0x280/0x3b8
[ 148.816276]  ip_rcv+0x48/0x100
[ 148.822634]  __netif_receive_skb_one_core+0x60/0x88
[ 148.829024]  __netif_receive_skb+0x20/0x78
[ 148.835363]  process_backlog+0xbc/0x1a8
[ 148.841706]  __napi_poll+0x44/0x230
[ 118.848854]  net_rx_action+0x298/0x2e0
[ 148.854436]  __do_softirq+0x1a8/0x4ec
[ 148.860843]  do_softirq+0xcc/0xe0
[ 148.867245]  __local_bh_enable_ip+0x100/0x108
[ 148.873687]  fpsimd_restore_current_state+0x5c/0xc8
[ 148.880176]  do_notify_resume+0xcc/0x468
[ 148.886688]  el0_svc+0x58/0x60
[ 148.893285]  el0t_64_sync_handler+0x90/0xb8
[ 148.899757]  el0t_64_sync+0x1a0/0x1a4
[ 148.906271] Code: 91004294 6b13001f 5400028d f9400280 (f9400401)
[ 148.912811] ---[ end trace b713d99e7412a07f ]---
[ 148.919389] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[ 148.926091] SMP: stopping secondary CPUs
[ 148.932775] Kernel Offset: 0x2016200000 from 0xffffffc008000000
[ 148.939485] PHYS_OFFSET: 0x0
[ 148.946195] CPU features: 0x00003401,00000846
[ 148.952936] Memory Limit: none
[ 148.9596611] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---

dmesg | grep brcmf yields a number of the following errors:

brcmfmac: CONSOLE: 000000.096 wl0: unable to find iovar "rsdb_mode"
brcmfmac: CONSOLE: 000000.096 wl0: wlc_iovar_op: rsdb_mode BCME -23 (Unsupported)
brcmfmac: CONSOLE: 000004.597 wl0: wlc_phy_set_regtbl_on_femctrl: FIXME bt_coex
brcmfmac: CONSOLE: 000005.135 wl0: unable to find iovar "toe_ol"
brcmfmac: CONSOLE: 000005.135 wl0: wlc_iovar_op: toe_ol BCME -23 (Unsupported)
brcmfmac: CONSOLE: 000008.059 wl0: malformed chanspec 0x0
brcmfmac: CONSOLE: 000008.802 wl0: unable to find iovar "nd_hostip_clear"
brcmfmac: CONSOLE: 000008.802 wl0: wlc_iovar_op: nd_hostip_clear BCME -23 (Unsupported)

at various intervals. The nd_hostip_entry line is the last to be recorded prior to panic.

Additional context

No response

fkelava commented 1 year ago

I've investigated further. The same exact board (with same exact HATs and power supply) does not exhibit any AP mode weirdness on Ubuntu Core 22. Once again, my methodology was the same; update the system to obtain the latest packages, set up NetworkManager and ModemManager, and enable the AP in the most straightforward way.

I'm happy to try and obtain further logs or traces for you to hopefully narrow this down, assuming you can reproduce it.

pelwell commented 1 year ago

It worked for me this evening on an updated older image with the exact same kernel on a 4B (same wireless chip and firmware), but I can test a new Lite image on a 3B+ tomorrow.

fkelava commented 1 year ago

You won't need to. I've pinpointed the defect to NetworkManager/ModemManager. I tried the latest RasPi OS 11 Lite again, using Quectel's own QMI-based connection tool for LTE and dhcpcd, dnsmasq and hostapd for the AP. Lo and behold, no issue in sight. I also eventually got Ubuntu Core 22 to panic in the exact same way with NM/MM, it just took it a bit longer. Clearly there's something amiss in this specific combination- I leave it to more capable minds to discover what.

For posterity, these are my findings: if the LTE USB is attached after the system boots and the AP is already engaged, LTE will never connect and display a signal strength of 0% permanently. If the LTE USB is attached at boot, it will connect, but then any client connecting to the AP will instantly result in panic. Either works fine on its own, it's the combination that kills.

Feel free to reopen this issue if this is ever encountered in the wild again.