Closed elupus closed 2 years ago
I've not tested this yet. And the test suite is quite miniscule :)
To do some type of verification i had to add some test cases. I'm not 100% sure i got everything correct. I also had to create a new path parser. I've not implemented it's use everywhere where handles are parsed from paths. Though that would make sense to do as a separate pull.
This approach was my first thought as well. However, after looking at what would be needed to handle this cross-platform, I think a different approach may be needed.
Backing up a bit, we need to understand why we are getting InterfacesRemoved
in the first place. I recently encountered a bug with WebBluetooth on Linux that I think is basically the same cause as https://github.com/hbldh/bleak/issues/376 in Bleak. In my case, we added a PnP ID characteristic to the Device Information service. This apparently triggers something in BlueZ that causes it to remove the service momentarily before adding it back again.
IIRC, the sequence of events is like this: (EDIT: I did not recall correctly - marked out lined do not actually happen)
Now changing subjects for a moment to discuss the cross-platform aspect. An interesting observation is the change in the ServicesResolved property. The corresponding APIs on other platforms are the BluetoothLEDevice.GattServicesChanged Event on Windows and the CBPeripheralDelegate peripheral:didModifyServices: method on Mac. After both of these events, services must be manually re-enumerated. Mac gives a list of services that were invalidated, but Windows does not give any extra information. So the ServicesResolved: False signal in BlueZ is essentially the same event as the others just described. BlueZ just does the re-enumeration automatically instead of manually.
So here is my thinking...
Disable the handling of InterfacesAdded when the connect()
method exits.
The connect()
method in the BlueZ backend waits for ServicesResolved: True. We need to handle InterfacesAdded before this, but we can disable it after this. This basically gets us on par with the other platforms. If services really change, we don't handle this - but this is already case on other platforms as well (see solution below). If BlueZ does its quirky thing of removing and adding the same services back (with same characteristics and descriptors), then the existing BleakGATTServiceCollection
is still valid.
One potential issue here is the window of time between ServicesResolved: False and ServicesResolved: True where D-Bus objects may be removed and not added back yet. If we try to write a characteristic (or enable notifications) during this time, we will get a D-Bus error that the object does not exist on the bus. We could potentially work around this by adding a check to every method that access these D-Bus objects to say "if ServicesResolved == False then await ServicesResolved == True" before trying to send the D-Bus message.
Technically, BLE services can change dynamically after connecting. This is why the events describe above exist. However, I don't think this is used often (and I don't recall any Bleak issues requesting support for this) so we could probably live without this unless there is actually demand for it.
To implement this, basically we would need to add a callback registration similar to the disconnect callback to handle this event. When this event occurs, users would be expected to call get_services()
to re-enumerate the services and discard any existing service/characteristic/descriptor objects since these are no longer valid.
Just fixing up the tests. I'm not entirely sure there is anything wrong with supporting the removal of interfaces like this pull does.
I think there could be a race problem with your above indicated semi solution. I think in my case i have race situation where an announce from the device occurs right when we are attempting to connect to it. I noticed i had setup timing in such a way so they lined up.
So i think the client registered for signals, then got the announce from the device (which added interfaces), then connected or something like that. That could be why the services list is different.
This race does seem to remain in the code. So maybe the client should not be listening to ServicesAdded/Removed at all. Only rely on the data it get from GetManagedObjects() on connect?
Can you share logs that show the exact sequence of events you are seeing (mainly interested in the D-Bus traffic)?
It could also be helpful to have bluetoothctl
running in a terminal while the problem happens.
It could also be helpful to have
bluetoothctl
running in a terminal while the problem happens.
I was able to get a reproducible case for this issue. See https://github.com/hbldh/bleak/issues/628 for such a log.
I'm also curious if the workaround suggested in https://github.com/hbldh/bleak/issues/628#issuecomment-904181977 fixes the issue without changing Bleak.
I'm not entirely sure there is anything wrong with supporting the removal of interfaces like this pull does.
Assuming we are seeing the same issue, I mostly agree. However, since there is no similar mechanism to remove individual services/characteristics/descriptors on other platforms, I think there may be a similar alternative that doesn't involve adding del_*
methods that will only apply to BlueZ.
This new suggestion is also motivated by the fact that GetManagedObjects
has a major performance impact on constrained systems (like routers) and will hopefully be changed in the future anyway.
In scanner.py
we basically make a cache of all D-Bus object properties received in a dictionary so that we don't have to call individual Get
or GetAll
to get the properties. We could do the same here (with the idea of eventually only calling GetManagedObjects
once globally for Bleak instead of each time we connect or start scanning - so if the implementations are mostly the same now, there will be less work to do later).
So instead of creating and modifying the BleakGATTServiceCollection
object when we call "GetManagedObjects"
and received the "InterfacedAdded"
, "InterfacesRemoved"
and "PropertiesChanged"
signals, we could just maintain a simple dictionary with the cached object paths and properties. Then when we receive ServicesResolved: True, we populate the BleakGATTServiceCollection
object all at once.
This would probably also reduce some duplicate code between https://github.com/hbldh/bleak/blob/a83283411729f7fa42a68f827efe3cf2285ae03a/bleak/backends/bluezdbus/client.py#L254-L286 and https://github.com/hbldh/bleak/blob/a83283411729f7fa42a68f827efe3cf2285ae03a/bleak/backends/bluezdbus/client.py#L1045-L1070.
As noted in https://github.com/bluez/bluez/issues/191#issuecomment-904712819, it seems that even if we take this fix, the new interfaced added by BlueZ will be wrong (assuming we are dealing with the same issue that triggers the D-Bus object removal). So, if that is the case, it might actually be better to not fix Bleak since as-is, users all get the same error message which they can google and find the solution (manually deleting the cache file). Whereas if we "fix" the issue in Bleak, users will likely get a variety of errors or strange behaviors and it will be much harder to find the solution.
[DEL] Characteristic (Handle 0x1dd1)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char000f
00002a29-0000-1000-8000-00805f9b34fb
Manufacturer Name String
[DEL] Characteristic (Handle 0x1dd1)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0011
00002a24-0000-1000-8000-00805f9b34fb
Model Number String
[DEL] Characteristic (Handle 0x1dd1)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0013
00002a26-0000-1000-8000-00805f9b34fb
Firmware Revision String
[DEL] Characteristic (Handle 0x1dd1)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0015
00002a28-0000-1000-8000-00805f9b34fb
Software Revision String
[DEL] Secondary Service (Handle 0x0027)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e
0000180a-0000-1000-8000-00805f9b34fb
Device Information
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001800-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001801-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 77a2bd49-1e5a-4961-bba1-21f34fa4bc7b
[NEW] Secondary Service (Handle 0xbef9)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e
0000180a-0000-1000-8000-00805f9b34fb
Device Information
[NEW] Characteristic (Handle 0x9149)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char000f
00002a29-0000-1000-8000-00805f9b34fb
Manufacturer Name String
[NEW] Characteristic (Handle 0x9149)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0011
00002a24-0000-1000-8000-00805f9b34fb
Model Number String
[NEW] Characteristic (Handle 0x9149)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0013
00002a26-0000-1000-8000-00805f9b34fb
Firmware Revision String
[NEW] Characteristic (Handle 0x9149)
/org/bluez/hci0/dev_CC_F9_57_96_4B_7B/service000e/char0015
00002a28-0000-1000-8000-00805f9b34fb
Software Revision String
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001800-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 00001801-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 0000180a-0000-1000-8000-00805f9b34fb
[CHG] Device CC:F9:57:96:4B:7B UUIDs: 77a2bd49-1e5a-4961-bba1-21f34fa4bc7b
[CHG] Device CC:F9:57:96:4B:7B ServicesResolved: yes
Hard to get those logs. I seem to get spammed with rssi updates :). Let me see if removing the cache drops the error.
The cache does not seem to include this address on this system.
Scratch that. Found the file. Missed the controller subdirectory. Removed the cache and indeed. It might actually solve the issue. Rather ugly workaround thou :)
The cache does not seem to include this address on this system.
Do you have
[GATT]
Cache = no
Set in /etc/bluetooth/*.conf
?
Also, the cache file is in a protected directory, so you won't see the file unless you are root.
Is the updated layout correct (handle numbers and UUIDS)? On my device, it had new firmware with additional characteristics. The old characteristics were deleted ([DEL]
messages), but only the same number of attributes were added back. The last handle was 0x0013 (/org/bluez/hci0/dev_90_84_2B_00_A8_8D/service0010/char0011/desc0013
) but should have been 0x0019 (/org/bluez/hci0/dev_90_84_2B_00_A8_8D/service0014/char0017/desc0019
).
Stupidly i forgot to grab copies of the file before i deleted it. I managed to get this screenshot. I'm just trying to see if i can spot a difference.
Issue re-occured. Let me see if i can grab the file now.
It feels like this devices might be doing weird announces from time to time. I've had it crash, and suddenly be announcing with a different name and different service set. Felt like some sort of fallback values.
I deleted the file. Connected, to the device and directly grabbed the cache. Waited a while for the announces to come in, and voila! Several additional attributes are suddenly added to the set.
So it seems it might be giving different data in announce and when connected?
So it seems it might be giving different data in announce and when connected?
When you say "announce" do you mean advertising data or the Service Changed characteristic?
I'm not sure what i mean to be honest. I think i mean advertising data. I'm not at all up to date on BLE protocols. I mean the services and characteristics found when there is only announces going on. But now that you question it, I think there is no known characteristics from just advertisements? They are only resolved once we connect right.
My device is on the limit of radio access. So i get these early disconnects when connecting that i opened the other issue on. Could bluez internally get a messed up cache when we get early disconnects?
The advertising data may contain a list of service UUIDs, but generally not characteristics. If you aren't using this, you could try setting
[GATT]
Cache = yes
in /etc/bluetooth/main.conf
to disable caching for devices that don't pair (also called trusted or bonded). There is a different bug in BlueZ that breaks the service UUIDs in advertising data when this setting is set to yes
or no
(default is always
), so this workaround isn't for everyone.
You can use Wireshark to capture Bluetooth traffic to see what exactly is going on. https://bleak.readthedocs.io/en/latest/troubleshooting.html#linux
In my case i could live without advertisement of services. It is detected on name. But i can't run wireshark here. It's an embedded system hassos for home assistant. Only linux system currently here. I suppose I could see if i can spin up something more standard somwhere to test out. Maybe i can try to grab logs on osx.
I think there are ways to grab the capture data from the command line (e.g. btmon
) and open it on another computer too. Running Bleak on Mac and capturing that would not be as good since it doesn't have BlueZ and would likely handle certain things differently.
Re-open if I am incorrect, but it seems like we agree that this change is not currently needed.
Well it might be. But the bug remains here. I need advertising data, so can't really use it if it's another bug. We still some fix for the exceptions thrown.
Oh sorry. It was just the advertisment handling of service uuids that had bugs. So maybe i could till get the manufacture data. But i still think we need some fix for the lib throwing exceptions.
FYI, some of the BlueZ bugs will be fixed in 5.62 when it is released if upgrading BlueZ is a possibility.
@dlech here is a btmon dump from the device: https://drive.google.com/file/d/12-APcJs8P_pMMniwKw_eHCeQtkV0YHAU/view?usp=sharing
Let me know when you have it, so i can remove the share. I was trying to understand the data, but it's too much of an unknown territory for me.
You can wireshark filter on this address to get the device: bthci_evt.bd_addr == cc:f9:57:96:4b:7b
We will also need D-Bus logs from the same time period to see what BlueZ is doing in response to the Bluetooth packets.
I don't see anything in the btmon logs that would trigger re-enumerating services. The services and characteristics are only enumerated once per connection and there are no Services Changed indications sent. So either this is the same bug with the BlueZ cache that I was seeing or maybe Bleak isn't properly clearing the services list when the device disconnects (but I thought we fixed that already a while ago)?
bthci_evt.bd_addr == cc:f9:57:96:4b:7b
FWIW, I had better luck with this filter: (!(bthci_evt.code == 0x3e)) && !(frame.protocols == "bluetooth:hci_mon")
Ps. I managed to temporary hack hassos (bind mounted a main.conf onto the readonly file system :)) to run with Cache=yes, and the error goes away. That doesn't feel like a solution thou.
That doesn't feel like a solution thou.
Actually, it should be a good workaround for the BlueZ bug, but I found when doing this, it triggered a different BlueZ bug that prevented services UUIDs from being returned when scanning the device for a second time. But, if you don't need service UUIDs from advertising data, then this should work.
run with Cache=yes, and the error goes away
Based on this information, it sounds like we are both seeing the same problem. So I am still of the opinion that avoiding the exception it causes would make it harder for people to find a solution.
It feels bad to cause exceptions that is just propagating out to asyncio event loop, when the function actually work when they are not thrown. Warning seem more appropriate. By just warning, things work as expected. I get exceptions in my logs once every announce from the device with full backtraces. So the log get's fully filled by just that.
The user of the library is unable to catch them since they occur straight in the asyncio executor.
By just warning, things work as expected.
In my case though, they didn't work as expected. BlueZ gave the old characteristics instead of the new ones. So suppressing the exception would cause strange errors later, like enabling notifications not working because it would in effect write to the wrong handle.
In your case, I think you are just lucky that the old and new are the same. But there is no way for Bleak to know that this is the case that I can see.
Also, did you try the workaround of manually removing the cache file? If the attribute layout on the device is not expected to change in the future, this may be a better workaround than Cache=yes
.
bluetoothctl -- remove XX:XX:XX:XX:XX:XX
# prior to BlueZ 5.62 you also need to manually delete the GATT cache
sudo rm "/var/lib/bluetooth/YY:YY:YY:YY:YY:YY/cache/XX:XX:XX:XX:XX:XX"
Won't help, the layout is different every time the device in question boots. It will always announce with one set on initial boot then change. Sometimes it crashes and revert to the initial layout.
This is some generic ble module from ublox which is configured with AT commands on a serial port. So at initial boot it is unconfigured. For some reason it seem to enable before getting an initial config.
Cache=yes works here. But... I will have hard time making that the default for the whole of home assistant.
Do the services in the initial setup include the GATT Database Hash characteristic (0x2B2A) or Service Changed characteristic (0x2A05)?
I had a look at the last log i shared with you. I can't find any database hash or service changed.
On the initial announce from the device where service data is just "passively" sent: we get this service identifier: 01d7e9014ff344e7838fe226b9e15624
Frame 33: 44 bytes on wire (352 bits), 44 bytes captured (352 bits)
Bluetooth
Bluetooth Linux Monitor Transport
Bluetooth HCI Event - LE Meta
Event Code: LE Meta (0x3e)
Parameter Total Length: 42
Sub Event: LE Advertising Report (0x02)
Num Reports: 1
Event Type: Connectable Undirected Advertising (0x00)
Peer Address Type: Public Device Address (0x00)
BD_ADDR: cc:f9:57:96:4b:7b (cc:f9:57:96:4b:7b)
Data Length: 30
Advertising Data
Flags
Tx Power Level
Slave Connection Interval Range: 30 - 50 msec
Connection Interval Min: 24 (30 msec)
Connection Interval Max: 40 (50 msec)
128-bit Service Class UUIDs
Length: 17
Type: 128-bit Service Class UUIDs (0x07)
Custom UUID: 01d7e9014ff344e7838fe226b9e15624 (Unknown)
RSSI (dB): -78
Somewhat later, it announces 7bbca44ff321a1bb61495a1e49bda277
Frame 210: 35 bytes on wire (280 bits), 35 bytes captured (280 bits) Bluetooth Bluetooth Linux Monitor Transport Bluetooth HCI Event - LE Meta Event Code: LE Meta (0x3e) Parameter Total Length: 33 Sub Event: LE Advertising Report (0x02) Num Reports: 1 Event Type: Connectable Undirected Advertising (0x00) Peer Address Type: Public Device Address (0x00) BD_ADDR: cc:f9:57:96:4b:7b (cc:f9:57:96:4b:7b) Data Length: 21 Advertising Data Flags 128-bit Service Class UUIDs Length: 17 Type: 128-bit Service Class UUIDs (0x07) Custom UUID: 7bbca44ff321a1bb61495a1e49bda277 (Unknown) RSSI (dB): -81
On connection later on, we get the same 7bbca44ff321a1bb61495a1e49bda277 as second announce above.
Frame 3000: 30 bytes on wire (240 bits), 30 bytes captured (240 bits)
Bluetooth
Bluetooth Linux Monitor Transport
[Adapter ID: 0]
[Opcode: ACL Rx Packet (0x0005)]
Bluetooth HCI ACL Packet
Bluetooth L2CAP Protocol
Bluetooth Attribute Protocol
Opcode: Read By Group Type Response (0x11)
Length: 20
Attribute Data, Handle: 0x0017, Group End Handle: 0xffff, UUID128: Unknown
Handle: 0x0017 (Unknown)
Group End Handle: 0xffff
UUID: 7bbca44ff321a1bb61495a1e49bda277
[UUID: GATT Primary Service Declaration (0x2800)]
[Request in Frame: 2998]
So i'm not sure how the cache behaves in bluez here. Maybe it never prunes the initial service, and on every connect starts up by saying, oh.. this service doesn't exist, let's ask to remove it. on every connect.
The services in the advertising data are independent of the services that are enumerated after connecting. The advertising data just sends the UUID and nothing else.
I can't find any database hash or service changed.
I looked again and it looks like there is a service changed attribute (handle 0x000d). BlueZ enables indications on this when the device connects.
Another way to see all services/characteristics is to connect to the device with nRF Connect on Android.
Looking at the packet log again, I can see that only handles up to and not including 0x0013 are enumerated when the device connects, so this must be the first set of initial attributes. Then later, handle 0x0019 is written, which must be the custom attribute that is added later when the device is reconfigured. So, I think the device should be indicating on the service changed characteristic to indicate to BlueZ that things changed.
Looking at this from a different angle, would it be possible to make the device not advertise until after the services/characteristics are reconfigured? This should avoid confusing BlueZ in the first place.
I wish.. but these are likely in the wild, and are very unlikely to get firmware updates (kitchen fans with no way short of debugger connection to update). I will forward it to the company though.
But why would it indicate service changed? the services likely changed while blues was not connected.
It could very well be that the services change before connecting. In the packet logs though, for some reason, it gives a different response to the same request at different times. I've highlighted it in the screenshot below.
The first time, it claims there are attributes 0x0001-0x0012 but the second time it says there are attributes 0x0001-0x0016.
I think this is what is triggering the removal and re-adding of D-Bus interfaces.
At this point, I feel like I understand what is going on well enough to make a decision. I think Cache=yes
would probably be the best "fix", but you mentioned this was difficult to implement (and it can trigger other BlueZ bugs), so I will consider merging #622 if that works for you.
And thanks for all of the feedback and persistence on this issue.
Yes #622 works fine. That way we keep indicating the oddness / bad behaviour of the device, but we don't bail out of additional message handling.
Yes, Cache=yes might be best for this specific device, but the ble client might have multiple other things active. Beacons, phones and what nots, and i would be unable to argue for making that the default on host level.
i had missed what you highlighted above. I will see if I can get contact with the SW devs, but my hopes are not too high. There is a middle man between the actual developers and me.
Support removal of interfaces (service, characteristics, descriptors) for bluez backends.