hbldh / bleak

A cross platform Bluetooth Low Energy Client for Python using asyncio
MIT License
1.6k stars 280 forks source link

Add support for removal of interfaces #625

Closed elupus closed 2 years ago

elupus commented 2 years ago

Support removal of interfaces (service, characteristics, descriptors) for bluez backends.

elupus commented 2 years ago

I've not tested this yet. And the test suite is quite miniscule :)

elupus commented 2 years ago

To do some type of verification i had to add some test cases. I'm not 100% sure i got everything correct. I also had to create a new path parser. I've not implemented it's use everywhere where handles are parsed from paths. Though that would make sense to do as a separate pull.

dlech commented 2 years ago

This approach was my first thought as well. However, after looking at what would be needed to handle this cross-platform, I think a different approach may be needed.

Backing up a bit, we need to understand why we are getting InterfacesRemoved in the first place. I recently encountered a bug with WebBluetooth on Linux that I think is basically the same cause as https://github.com/hbldh/bleak/issues/376 in Bleak. In my case, we added a PnP ID characteristic to the Device Information service. This apparently triggers something in BlueZ that causes it to remove the service momentarily before adding it back again.

IIRC, the sequence of events is like this: (EDIT: I did not recall correctly - marked out lined do not actually happen)

Now changing subjects for a moment to discuss the cross-platform aspect. An interesting observation is the change in the ServicesResolved property. The corresponding APIs on other platforms are the BluetoothLEDevice.GattServicesChanged Event on Windows and the CBPeripheralDelegate peripheral:didModifyServices: method on Mac. After both of these events, services must be manually re-enumerated. Mac gives a list of services that were invalidated, but Windows does not give any extra information. So the ServicesResolved: False signal in BlueZ is essentially the same event as the others just described. BlueZ just does the re-enumeration automatically instead of manually.

So here is my thinking...

The quick hack solution

Disable the handling of InterfacesAdded when the connect() method exits.

The connect() method in the BlueZ backend waits for ServicesResolved: True. We need to handle InterfacesAdded before this, but we can disable it after this. This basically gets us on par with the other platforms. If services really change, we don't handle this - but this is already case on other platforms as well (see solution below). If BlueZ does its quirky thing of removing and adding the same services back (with same characteristics and descriptors), then the existing BleakGATTServiceCollection is still valid.

One potential issue here is the window of time between ServicesResolved: False and ServicesResolved: True where D-Bus objects may be removed and not added back yet. If we try to write a characteristic (or enable notifications) during this time, we will get a D-Bus error that the object does not exist on the bus. We could potentially work around this by adding a check to every method that access these D-Bus objects to say "if ServicesResolved == False then await ServicesResolved == True" before trying to send the D-Bus message.

The cross-platform solution

Technically, BLE services can change dynamically after connecting. This is why the events describe above exist. However, I don't think this is used often (and I don't recall any Bleak issues requesting support for this) so we could probably live without this unless there is actually demand for it.

To implement this, basically we would need to add a callback registration similar to the disconnect callback to handle this event. When this event occurs, users would be expected to call get_services() to re-enumerate the services and discard any existing service/characteristic/descriptor objects since these are no longer valid.

elupus commented 2 years ago

Just fixing up the tests. I'm not entirely sure there is anything wrong with supporting the removal of interfaces like this pull does.

I think there could be a race problem with your above indicated semi solution. I think in my case i have race situation where an announce from the device occurs right when we are attempting to connect to it. I noticed i had setup timing in such a way so they lined up.

So i think the client registered for signals, then got the announce from the device (which added interfaces), then connected or something like that. That could be why the services list is different.

This race does seem to remain in the code. So maybe the client should not be listening to ServicesAdded/Removed at all. Only rely on the data it get from GetManagedObjects() on connect?

dlech commented 2 years ago

Can you share logs that show the exact sequence of events you are seeing (mainly interested in the D-Bus traffic)?

dlech commented 2 years ago

It could also be helpful to have bluetoothctl running in a terminal while the problem happens.

dlech commented 2 years ago

It could also be helpful to have bluetoothctl running in a terminal while the problem happens.

I was able to get a reproducible case for this issue. See https://github.com/hbldh/bleak/issues/628 for such a log.

I'm also curious if the workaround suggested in https://github.com/hbldh/bleak/issues/628#issuecomment-904181977 fixes the issue without changing Bleak.

I'm not entirely sure there is anything wrong with supporting the removal of interfaces like this pull does.

Assuming we are seeing the same issue, I mostly agree. However, since there is no similar mechanism to remove individual services/characteristics/descriptors on other platforms, I think there may be a similar alternative that doesn't involve adding del_* methods that will only apply to BlueZ.

This new suggestion is also motivated by the fact that GetManagedObjects has a major performance impact on constrained systems (like routers) and will hopefully be changed in the future anyway.

New suggestion

In scanner.py we basically make a cache of all D-Bus object properties received in a dictionary so that we don't have to call individual Get or GetAll to get the properties. We could do the same here (with the idea of eventually only calling GetManagedObjects once globally for Bleak instead of each time we connect or start scanning - so if the implementations are mostly the same now, there will be less work to do later).

So instead of creating and modifying the BleakGATTServiceCollection object when we call "GetManagedObjects" and received the "InterfacedAdded", "InterfacesRemoved" and "PropertiesChanged" signals, we could just maintain a simple dictionary with the cached object paths and properties. Then when we receive ServicesResolved: True, we populate the BleakGATTServiceCollection object all at once.

This would probably also reduce some duplicate code between https://github.com/hbldh/bleak/blob/a83283411729f7fa42a68f827efe3cf2285ae03a/bleak/backends/bluezdbus/client.py#L254-L286 and https://github.com/hbldh/bleak/blob/a83283411729f7fa42a68f827efe3cf2285ae03a/bleak/backends/bluezdbus/client.py#L1045-L1070.

dlech commented 2 years ago

As noted in https://github.com/bluez/bluez/issues/191#issuecomment-904712819, it seems that even if we take this fix, the new interfaced added by BlueZ will be wrong (assuming we are dealing with the same issue that triggers the D-Bus object removal). So, if that is the case, it might actually be better to not fix Bleak since as-is, users all get the same error message which they can google and find the solution (manually deleting the cache file). Whereas if we "fix" the issue in Bleak, users will likely get a variety of errors or strange behaviors and it will be much harder to find the solution.

elupus commented 2 years ago
[DEL] Characteristic (Handle 0x1dd1)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char000f
        00002a29-0000-1000-8000-00805f9b34fb
        Manufacturer Name String
[DEL] Characteristic (Handle 0x1dd1)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0011
        00002a24-0000-1000-8000-00805f9b34fb
        Model Number String
[DEL] Characteristic (Handle 0x1dd1)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0013
        00002a26-0000-1000-8000-00805f9b34fb
        Firmware Revision String
[DEL] Characteristic (Handle 0x1dd1)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0015
        00002a28-0000-1000-8000-00805f9b34fb
        Software Revision String
[DEL] Secondary Service (Handle 0x0027)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e
        0000180a-0000-1000-8000-00805f9b34fb
        Device Information
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001800-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001801-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 77a2bd49-1e5a-4961-bba1-21f34fa4bc7b
[NEW] Secondary Service (Handle 0xbef9)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e
        0000180a-0000-1000-8000-00805f9b34fb
        Device Information
[NEW] Characteristic (Handle 0x9149)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char000f
        00002a29-0000-1000-8000-00805f9b34fb
        Manufacturer Name String
[NEW] Characteristic (Handle 0x9149)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0011
        00002a24-0000-1000-8000-00805f9b34fb
        Model Number String
[NEW] Characteristic (Handle 0x9149)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0013
        00002a26-0000-1000-8000-00805f9b34fb
        Firmware Revision String
[NEW] Characteristic (Handle 0x9149)
        /org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0015
        00002a28-0000-1000-8000-00805f9b34fb
        Software Revision String
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001800-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001801-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 0000180a-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 77a2bd49-1e5a-4961-bba1-21f34fa4bc7b
[CHG] Device CC:F9:57:96:4B:7B ServicesResolved: yes

Hard to get those logs. I seem to get spammed with rssi updates :). Let me see if removing the cache drops the error.

elupus commented 2 years ago

The cache does not seem to include this address on this system.

elupus commented 2 years ago

Scratch that. Found the file. Missed the controller subdirectory. Removed the cache and indeed. It might actually solve the issue. Rather ugly workaround thou :)

dlech commented 2 years ago

The cache does not seem to include this address on this system.

Do you have

[GATT]
Cache = no

Set in /etc/bluetooth/*.conf?

Also, the cache file is in a protected directory, so you won't see the file unless you are root.


Is the updated layout correct (handle numbers and UUIDS)? On my device, it had new firmware with additional characteristics. The old characteristics were deleted ([DEL] messages), but only the same number of attributes were added back. The last handle was 0x0013 (/org/bluez/hci0/dev_90_84_2B_00_A8_8D/service0010/char0011/desc0013) but should have been 0x0019 (/org/bluez/hci0/dev_90_84_2B_00_A8_8D/service0014/char0017/desc0019).

elupus commented 2 years ago

image

Stupidly i forgot to grab copies of the file before i deleted it. I managed to get this screenshot. I'm just trying to see if i can spot a difference.

elupus commented 2 years ago

Issue re-occured. Let me see if i can grab the file now.

elupus commented 2 years ago

It feels like this devices might be doing weird announces from time to time. I've had it crash, and suddenly be announcing with a different name and different service set. Felt like some sort of fallback values.

I deleted the file. Connected, to the device and directly grabbed the cache. Waited a while for the announces to come in, and voila! Several additional attributes are suddenly added to the set.

image

So it seems it might be giving different data in announce and when connected?

dlech commented 2 years ago

So it seems it might be giving different data in announce and when connected?

When you say "announce" do you mean advertising data or the Service Changed characteristic?

elupus commented 2 years ago

I'm not sure what i mean to be honest. I think i mean advertising data. I'm not at all up to date on BLE protocols. I mean the services and characteristics found when there is only announces going on. But now that you question it, I think there is no known characteristics from just advertisements? They are only resolved once we connect right.

My device is on the limit of radio access. So i get these early disconnects when connecting that i opened the other issue on. Could bluez internally get a messed up cache when we get early disconnects?

dlech commented 2 years ago

The advertising data may contain a list of service UUIDs, but generally not characteristics. If you aren't using this, you could try setting

[GATT]
Cache = yes

in /etc/bluetooth/main.conf to disable caching for devices that don't pair (also called trusted or bonded). There is a different bug in BlueZ that breaks the service UUIDs in advertising data when this setting is set to yes or no (default is always), so this workaround isn't for everyone.

You can use Wireshark to capture Bluetooth traffic to see what exactly is going on. https://bleak.readthedocs.io/en/latest/troubleshooting.html#linux

elupus commented 2 years ago

In my case i could live without advertisement of services. It is detected on name. But i can't run wireshark here. It's an embedded system hassos for home assistant. Only linux system currently here. I suppose I could see if i can spin up something more standard somwhere to test out. Maybe i can try to grab logs on osx.

dlech commented 2 years ago

I think there are ways to grab the capture data from the command line (e.g. btmon) and open it on another computer too. Running Bleak on Mac and capturing that would not be as good since it doesn't have BlueZ and would likely handle certain things differently.

dlech commented 2 years ago

Re-open if I am incorrect, but it seems like we agree that this change is not currently needed.

elupus commented 2 years ago

Well it might be. But the bug remains here. I need advertising data, so can't really use it if it's another bug. We still some fix for the exceptions thrown.

elupus commented 2 years ago

Oh sorry. It was just the advertisment handling of service uuids that had bugs. So maybe i could till get the manufacture data. But i still think we need some fix for the lib throwing exceptions.

dlech commented 2 years ago

FYI, some of the BlueZ bugs will be fixed in 5.62 when it is released if upgrading BlueZ is a possibility.

elupus commented 2 years ago

@dlech here is a btmon dump from the device: https://drive.google.com/file/d/12-APcJs8P_pMMniwKw_eHCeQtkV0YHAU/view?usp=sharing

Let me know when you have it, so i can remove the share. I was trying to understand the data, but it's too much of an unknown territory for me.

You can wireshark filter on this address to get the device: bthci_evt.bd_addr == cc:f9:57:96:4b:7b

dlech commented 2 years ago

We will also need D-Bus logs from the same time period to see what BlueZ is doing in response to the Bluetooth packets.

dlech commented 2 years ago

I don't see anything in the btmon logs that would trigger re-enumerating services. The services and characteristics are only enumerated once per connection and there are no Services Changed indications sent. So either this is the same bug with the BlueZ cache that I was seeing or maybe Bleak isn't properly clearing the services list when the device disconnects (but I thought we fixed that already a while ago)?

dlech commented 2 years ago

bthci_evt.bd_addr == cc:f9:57:96:4b:7b

FWIW, I had better luck with this filter: (!(bthci_evt.code == 0x3e)) && !(frame.protocols == "bluetooth:hci_mon")

elupus commented 2 years ago

Ps. I managed to temporary hack hassos (bind mounted a main.conf onto the readonly file system :)) to run with Cache=yes, and the error goes away. That doesn't feel like a solution thou.

dlech commented 2 years ago

That doesn't feel like a solution thou.

Actually, it should be a good workaround for the BlueZ bug, but I found when doing this, it triggered a different BlueZ bug that prevented services UUIDs from being returned when scanning the device for a second time. But, if you don't need service UUIDs from advertising data, then this should work.

dlech commented 2 years ago

run with Cache=yes, and the error goes away

Based on this information, it sounds like we are both seeing the same problem. So I am still of the opinion that avoiding the exception it causes would make it harder for people to find a solution.

elupus commented 2 years ago

It feels bad to cause exceptions that is just propagating out to asyncio event loop, when the function actually work when they are not thrown. Warning seem more appropriate. By just warning, things work as expected. I get exceptions in my logs once every announce from the device with full backtraces. So the log get's fully filled by just that.

The user of the library is unable to catch them since they occur straight in the asyncio executor.

dlech commented 2 years ago

By just warning, things work as expected.

In my case though, they didn't work as expected. BlueZ gave the old characteristics instead of the new ones. So suppressing the exception would cause strange errors later, like enabling notifications not working because it would in effect write to the wrong handle.

In your case, I think you are just lucky that the old and new are the same. But there is no way for Bleak to know that this is the case that I can see.

dlech commented 2 years ago

Also, did you try the workaround of manually removing the cache file? If the attribute layout on the device is not expected to change in the future, this may be a better workaround than Cache=yes.

bluetoothctl -- remove XX:XX:XX:XX:XX:XX
# prior to BlueZ 5.62 you also need to manually delete the GATT cache
sudo rm "/var/lib/bluetooth/YY:YY:YY:YY:YY:YY/cache/XX:XX:XX:XX:XX:XX"
elupus commented 2 years ago

Won't help, the layout is different every time the device in question boots. It will always announce with one set on initial boot then change. Sometimes it crashes and revert to the initial layout.

This is some generic ble module from ublox which is configured with AT commands on a serial port. So at initial boot it is unconfigured. For some reason it seem to enable before getting an initial config.

Cache=yes works here. But... I will have hard time making that the default for the whole of home assistant.

dlech commented 2 years ago

Do the services in the initial setup include the GATT Database Hash characteristic (0x2B2A) or Service Changed characteristic (0x2A05)?

elupus commented 2 years ago

I had a look at the last log i shared with you. I can't find any database hash or service changed.

On the initial announce from the device where service data is just "passively" sent: we get this service identifier: 01d7e9014ff344e7838fe226b9e15624

Frame 33: 44 bytes on wire (352 bits), 44 bytes captured (352 bits)
Bluetooth
Bluetooth Linux Monitor Transport
Bluetooth HCI Event - LE Meta
    Event Code: LE Meta (0x3e)
    Parameter Total Length: 42
    Sub Event: LE Advertising Report (0x02)
    Num Reports: 1
    Event Type: Connectable Undirected Advertising (0x00)
    Peer Address Type: Public Device Address (0x00)
    BD_ADDR: cc:f9:57:96:4b:7b (cc:f9:57:96:4b:7b)
    Data Length: 30
    Advertising Data
        Flags
        Tx Power Level
        Slave Connection Interval Range: 30 - 50 msec
        Connection Interval Min: 24 (30 msec)
        Connection Interval Max: 40 (50 msec)
        128-bit Service Class UUIDs
            Length: 17
            Type: 128-bit Service Class UUIDs (0x07)
            Custom UUID: 01d7e9014ff344e7838fe226b9e15624 (Unknown)
    RSSI (dB): -78

Somewhat later, it announces 7bbca44ff321a1bb61495a1e49bda277

Frame 210: 35 bytes on wire (280 bits), 35 bytes captured (280 bits) Bluetooth Bluetooth Linux Monitor Transport Bluetooth HCI Event - LE Meta Event Code: LE Meta (0x3e) Parameter Total Length: 33 Sub Event: LE Advertising Report (0x02) Num Reports: 1 Event Type: Connectable Undirected Advertising (0x00) Peer Address Type: Public Device Address (0x00) BD_ADDR: cc:f9:57:96:4b:7b (cc:f9:57:96:4b:7b) Data Length: 21 Advertising Data Flags 128-bit Service Class UUIDs Length: 17 Type: 128-bit Service Class UUIDs (0x07) Custom UUID: 7bbca44ff321a1bb61495a1e49bda277 (Unknown) RSSI (dB): -81

On connection later on, we get the same 7bbca44ff321a1bb61495a1e49bda277 as second announce above.

Frame 3000: 30 bytes on wire (240 bits), 30 bytes captured (240 bits)
Bluetooth
Bluetooth Linux Monitor Transport
    [Adapter ID: 0]
    [Opcode: ACL Rx Packet (0x0005)]
Bluetooth HCI ACL Packet
Bluetooth L2CAP Protocol
Bluetooth Attribute Protocol
    Opcode: Read By Group Type Response (0x11)
    Length: 20
    Attribute Data, Handle: 0x0017, Group End Handle: 0xffff, UUID128: Unknown
        Handle: 0x0017 (Unknown)
        Group End Handle: 0xffff
        UUID: 7bbca44ff321a1bb61495a1e49bda277
    [UUID: GATT Primary Service Declaration (0x2800)]
    [Request in Frame: 2998]

So i'm not sure how the cache behaves in bluez here. Maybe it never prunes the initial service, and on every connect starts up by saying, oh.. this service doesn't exist, let's ask to remove it. on every connect.

dlech commented 2 years ago

The services in the advertising data are independent of the services that are enumerated after connecting. The advertising data just sends the UUID and nothing else.

I can't find any database hash or service changed.

I looked again and it looks like there is a service changed attribute (handle 0x000d). BlueZ enables indications on this when the device connects.

Another way to see all services/characteristics is to connect to the device with nRF Connect on Android.

Looking at the packet log again, I can see that only handles up to and not including 0x0013 are enumerated when the device connects, so this must be the first set of initial attributes. Then later, handle 0x0019 is written, which must be the custom attribute that is added later when the device is reconfigured. So, I think the device should be indicating on the service changed characteristic to indicate to BlueZ that things changed.


Looking at this from a different angle, would it be possible to make the device not advertise until after the services/characteristics are reconfigured? This should avoid confusing BlueZ in the first place.

elupus commented 2 years ago

I wish.. but these are likely in the wild, and are very unlikely to get firmware updates (kitchen fans with no way short of debugger connection to update). I will forward it to the company though.

But why would it indicate service changed? the services likely changed while blues was not connected.

dlech commented 2 years ago

It could very well be that the services change before connecting. In the packet logs though, for some reason, it gives a different response to the same request at different times. I've highlighted it in the screenshot below.

image

The first time, it claims there are attributes 0x0001-0x0012 but the second time it says there are attributes 0x0001-0x0016.

I think this is what is triggering the removal and re-adding of D-Bus interfaces.

dlech commented 2 years ago

At this point, I feel like I understand what is going on well enough to make a decision. I think Cache=yes would probably be the best "fix", but you mentioned this was difficult to implement (and it can trigger other BlueZ bugs), so I will consider merging #622 if that works for you.

dlech commented 2 years ago

And thanks for all of the feedback and persistence on this issue.

elupus commented 2 years ago

Yes #622 works fine. That way we keep indicating the oddness / bad behaviour of the device, but we don't bail out of additional message handling.

Yes, Cache=yes might be best for this specific device, but the ble client might have multiple other things active. Beacons, phones and what nots, and i would be unable to argue for making that the default on host level.

i had missed what you highlighted above. I will see if I can get contact with the SW devs, but my hopes are not too high. There is a middle man between the actual developers and me.