raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.17k stars 5.01k forks source link

Bluetooth unreliable on Linux 6.1 #5460

Open agners opened 1 year ago

agners commented 1 year ago

Describe the bug

In the latest Home Assistant OS release 10 we moved from the Raspberry Pi Linux kernel 5.15 to 6.1. Since the release 10 we got quite some reports about Bluetooth outages after a couple of hours runtime, specifically on Raspberry Pi devices. The outages on Raspberry Pi devices have the following kerne logs in common:

May 02 12:11:40 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:40 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:40 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:40 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:41 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:41 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Received unexpected HCI Event 0x00
May 02 12:11:41 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:47 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:47 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Received unexpected HCI Event 0x00
May 02 12:11:47 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:53 ha-shelf1-rpi3 kernel: Bluetooth: hci0: command 0x200c tx timeout
May 02 12:11:53 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Opcode 0x200c failed: -110
May 02 12:11:53 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Unable to disable scanning: -110
May 02 12:11:53 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 02 12:11:53 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Received unexpected HCI Event 0x00
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Opcode 0x200c failed: -110
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: command 0x200c tx timeout
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Unable to disable scanning: -110
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Unable to disable scanning: -110
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Opcode 0x200c failed: -16
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 02 12:11:55 ha-shelf1-rpi3 kernel: Bluetooth: hci0: stop background scanning failed: -16
May 02 12:12:05 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Opcode 0x200c failed: -16
May 02 12:12:05 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 02 12:12:05 ha-shelf1-rpi3 kernel: Bluetooth: hci0: Unable to disable scanning: -16

From this point onward it is no longer possible to use Bluetooth. A reboot recovers the system.

I've been able to reproduce the issue locally on the Raspberry Pi 3 Model B out of three Raspberry Pi devices (Raspberry Pi 3 Model B, 3+ Model B and 4). On the device which shows the problem, I've noticed that Bluetooth: hci0: Frame reassembly failed (-84) errors have been quite common even on Linux 5.15, but it seems that the older kernel was able to recover on its own (no successive command/Opcode etc. errors). On other devices, I've not seen the error even on Linux 5.15, so it seems this problem is device specific (only happens on some). However, from other people reports all device types seem affected (Pi 3/3+/Pi 4/CM4).

We are using the full UART for Bluetooth by default.

Since Home Assistant OS also updated other components (like BlueZ) I built a OS dev version using Linux 5.15.84 (tag 1.20230106), with all other components the latest version as used in HAOS 10.1. That build showed the reassembly errors too, but recovered just fine. So with this it is proven that the problem is a Linux 6.1 issue.

This is essentially a upstream report of our Home Assistant OS issue https://github.com/home-assistant/operating-system/issues/2485.

Steps to reproduce the behaviour

  1. Install Home Assistant 10.1
  2. Enable Bluetooth, probably also requires some Bluetooth BLE devices (e.g. the popular Xiaomi Mi Temperature and Humidity monitors).
  3. Wait for reassembly errors to occur and complete communication breakdown (typically 6-12h)

Device (s)

Raspberry Pi 3 Mod. B, Raspberry Pi 3 Mod. B+, Raspberry Pi 4 Mod. B, Raspberry Pi CM4, Raspberry Pi CM4 Lite

System

Home Assistant OS 10.1 Linux 6.1.21 (built from the 1.20230405 tag)

Logs

No response

Additional context

No response

pelwell commented 1 year ago

Bluetooth modem initialisation has changed recently and is now managed by the kernel, but from that point on it should be the same, so I'm surprised at the nature of the errors you and others are seeing.

I'm pleased to hear that you are using the recommended (and default) configuration of using UART0 (ttyAMA0) to drive the BT modem; it is the best option. The Pi 3 is the most troublesome because it doesn't connect up the flow-control signals (those pins were used for something else), and for that reason the baudrate is limited to 921600. However, that shouldn't have changed with the move to kernel modem initialisation.

Wait... tag 1.20230405 predates the big switchover, which went live three weeks later (2023-04-26), so something else must have changed between 5.15.84 and 6.1.21. In a way that's a relief, but it may make it harder to locate the breaking change.

agners commented 1 year ago

I'm pleased to hear that you are using the recommended (and default) configuration of using UART0 (ttyAMA0) to drive the BT modem; it is the best option. The Pi 3 is the most troublesome because it doesn't connect up the flow-control signals (those pins were used for something else), and for that reason the baudrate is limited to 921600. However, that shouldn't have changed with the move to kernel modem initialisation.

Just double checked, hci0 is indeed running on ttyAMA0. Also, we are using 460800 as baudrate. It seems we've added a patch a while ago to lower the baudrate, and there was also a reference to the https://github.com/raspberrypi/firmware/issues/1150 issue.

That said, this is my test instance. It could be that other folks which report the problem (especially on RPi4/CM4) actually use a different config.txt. I'll have to check.

pelwell commented 1 year ago

That's definitely using the old hciattach utility - if the kernel set up the modem there would be no /dev/ttyAMA0 to see.

Is the WiFi connected and in use?

agners commented 1 year ago

Is the WiFi connected and in use?

WiFi is not in use and disconnected.

pelwell commented 1 year ago

Thanks, that means my current test should be at least somewhat representative.

agners commented 1 year ago

Bluetooth modem initialisation has changed recently and is now managed by the kernel, but from that point on it should be the same, so I'm surprised at the nature of the errors you and others are seeing.

I see, this is the change you are referring to: https://github.com/raspberrypi/linux/commit/ec08e93837efa2e75e5cab4d80427fdf7ea23da1

Good to see we can push this out to the kernel, it will make userspace simpler. I'll definitely give this a try sometime in the future.

But I think the problem at hand is unrelated to the error we are seeing, as the HCI communication at this point is already under the kernels control. To me it looks like different parsing/error recovery behavior of the Bluetooth HCI handling logic in the kernel.

That said, I wonder why we even have these hci0: Frame reassembly failed. Is the low-level UART communication issues or is this higher level buffer overflows (lack of flow control)?

pelwell commented 1 year ago

But I think the problem at hand is unrelated to the error we are seeing, as the HCI communication at this point is already under the kernels control. To me it looks like different parsing/error recovery behavior of the Bluetooth HCI handling logic in the kernel.

Indeed.

That said, I wonder why we even have these hci0: Frame reassembly failed. Is the low-level UART communication issues or is this higher level buffer overflows (lack of flow control)?

If I knew that...

agners commented 1 year ago

That said, I wonder why we even have these hci0: Frame reassembly failed. Is the low-level UART communication issues or is this higher level buffer overflows (lack of flow control)?

If I knew that...

From what I can tell the driver checks the overrun flag of the FIFO (see https://github.com/raspberrypi/linux/blob/1.20230405/drivers/tty/serial/amba-pl011.c#L334-L335), so I should see overrun counts in /proc/tty/driver/ttyAMA if the latter is the case. Unfortunately, I just rebooted to try with 115200, so my system has cleared the stats again :cry: I'll check the value next time I encounter frame reassembly errors.

agners commented 1 year ago

So with 115200 I did not get any overruns for 24h+. Today I switched back to 921600, and got two events. And it seems those are buffer overruns:

# dmesg | grep hci0
[42686.118707] Bluetooth: hci0: Frame reassembly failed (-84)
[42686.118948] Bluetooth: hci0: Frame reassembly failed (-84)
[53440.602688] Bluetooth: hci0: Frame reassembly failed (-84)
[53440.602927] Bluetooth: hci0: Frame reassembly failed (-84)
# cat /proc/tty/driver/ttyAMA 
serinfo:1.0 driver revision:
0: uart:PL011 rev2 mmio:0x3F201000 irq:114 tx:197234 rx:3403475 oe:2 RTS|CTS|DTR

I see that the RPi config uses rather conservative values for the scheduler:

CONFIG_PREEMPT_VOLUNTARY=y
CONFIG_HZ=100

Maybe higher HZ and/or fully preemptive kernel helps here?

agners commented 1 year ago

That said, it is still interesting that the 5.15 was able to recover from these overruns. So ideally we'd have a Bluetooth stack which can handle the reassembly errors :sweat_smile:

agners commented 1 year ago

For testing I switched to a mainline kernel on the affected device. The mainline kernel has overruns as well, but it seems that the Bluetooth stack can recover in this situation:

# cat /proc/tty/driver/ttyAMA 
serinfo:1.0 driver revision:
0: uart:PL011 rev2 mmio:0x3F201000 irq:99 tx:57297 rx:487185 oe:10 RTS|CTS|DTR
# uname -a
Linux homeassistant 6.1.28-v8 #1 SMP PREEMPT Fri May 12 11:49:03 UTC 2023 aarch64 HAOS
# dmesg | grep hci0
[    9.028494] Bluetooth: hci0: BCM: chip id 94
[    9.029017] Bluetooth: hci0: BCM: features 0x2e
[    9.030785] Bluetooth: hci0: BCM43430A1
[    9.030825] Bluetooth: hci0: BCM43430A1 (001.002.009) build 0000
[    9.045975] Bluetooth: hci0: BCM43430A1 'brcm/BCM43430A1.hcd' Patch
[    9.668550] Bluetooth: hci0: BCM: features 0x2e
[    9.670367] Bluetooth: hci0: BCM43438A1 37.4MHz Raspberry Pi 3-0141
[    9.670403] Bluetooth: hci0: BCM43430A1 (001.002.009) build 0508
[   47.651404] Bluetooth: hci0: Opcode 0x c14 failed: -110
[   47.651404] Bluetooth: hci0: command 0x0c14 tx timeout
[   48.469660] Bluetooth: hci0: unexpected event for opcode 0x0c14
[   48.471036] Bluetooth: hci0: Frame reassembly failed (-84)
[   50.467406] Bluetooth: hci0: Opcode 0x c03 failed: -110
[  547.339672] Bluetooth: hci0: Frame reassembly failed (-84)
[ 1266.027436] Bluetooth: hci0: Frame reassembly failed (-90)
[ 1266.027576] Bluetooth: hci0: Frame reassembly failed (-84)
[ 3305.180717] Bluetooth: hci0: Frame reassembly failed (-84)
[ 3305.180838] Bluetooth: hci0: Frame reassembly failed (-84)
[ 5532.156662] Bluetooth: hci0: Frame reassembly failed (-84)
[ 5532.156774] Bluetooth: hci0: Frame reassembly failed (-84)
[ 6435.178340] Bluetooth: hci0: Frame reassembly failed (-84)
[ 6435.178495] Bluetooth: hci0: Frame reassembly failed (-84)
[ 6675.408159] Bluetooth: hci0: Frame reassembly failed (-84)
[ 6675.408308] Bluetooth: hci0: Frame reassembly failed (-84)
[ 9848.851944] Bluetooth: hci0: Frame reassembly failed (-84)

I've not yet reviewed/checked downstream changes, but either there is an upstream fix between 6.1.21 and 6.1.28 or there is something broken in the downstream kernel.

popcornmix commented 1 year ago

I've not yet reviewed/checked downstream changes, but either there is an upstream fix between 6.1.21 and 6.1.28 or there is something broken in the downstream kernel.

rpi-update is now on 6.1.28 so testing that may narrow things down.

agners commented 1 year ago

I've tested with the kernel tree at https://github.com/raspberrypi/linux/commit/cb9e651718d601c6f2ff1116b4c98c11063302f7. I still use the old initialization mechanism, however, since that is only initialization, it shouldn't matter really.

May 22 16:25:26 homeassistant kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 22 16:25:26 homeassistant kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 22 17:00:40 homeassistant kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 22 17:00:40 homeassistant kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 22 17:02:28 homeassistant kernel: Bluetooth: hci0: Frame reassembly failed (-84)
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: command 0x200c tx timeout
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Opcode 0x200c failed: -110
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -110
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Opcode 0x200c failed: -16
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Opcode 0x200c failed: -16
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 22 17:02:30 homeassistant kernel: Bluetooth: hci0: stop background scanning failed: -16
May 22 17:02:40 homeassistant kernel: Bluetooth: hci0: Opcode 0x200c failed: -16
May 22 17:02:40 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 22 17:02:40 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 22 17:02:40 homeassistant kernel: Bluetooth: hci0: Opcode 0x200c failed: -16
May 22 17:02:40 homeassistant kernel: Bluetooth: hci0: Unable to disable scanning: -16
May 22 17:02:40 homeassistant kernel: Bluetooth: hci0: stop background scanning failed: -16
pelwell commented 1 year ago

I agree that the initialisation method shouldn't affect the issue, but until I can reproduce and understand it I'm not going to rule it out.

Are you seeing overruns with those errors?

agners commented 1 year ago

Are you seeing overruns with those errors?

Yes, the overruns strongly correlate with the Frame assembly failed errors.