ukBaz / BLE_GATT

Python package for using BlueZ D-Bus API to create a Client
MIT License
29 stars 5 forks source link

unit.char_read causing code to crash #11

Closed stekicar closed 1 year ago

stekicar commented 1 year ago

Hello,

I am running code on RPi Zero W that reads battery status over bluetooth and it crashes frequently at bytes(ubit.char_read(CHAR_UUID)) Daemon log is:

Feb 16 23:46:56 tsc python3[803]: CPU temp: 39.008
Feb 16 23:46:56 tsc python3[803]: Traceback (most recent call last):
Feb 16 23:46:56 tsc python3[803]:   File "/home/pi/test/test_code.py", line 458, in <module>
Feb 16 23:46:56 tsc python3[803]:     byte_data = bytes(ubit.char_read(CHAR_UUID))
Feb 16 23:46:56 tsc python3[803]:   File "/home/pi/.local/lib/python3.9/site-packages/BLE_GATT/central.py", line 81, in char_read
Feb 16 23:46:56 tsc python3[803]:     return self.chrcs[uuid.casefold()].ReadValue({})
Feb 16 23:46:56 tsc python3[803]:   File "/home/pi/.local/lib/python3.9/site-packages/pydbus/proxy_method.py", line 72, in __call__
Feb 16 23:46:56 tsc python3[803]:     ret = instance._bus.con.call_sync(
Feb 16 23:46:56 tsc python3[803]: gi.repository.GLib.Error: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Not connected (36)
Feb 16 23:46:57 tsc systemd[1]: test.service: Main process exited, code=exited, status=1/FAILURE
Feb 16 23:46:57 tsc systemd[1]: test.service: Failed with result 'exit-code'.
Feb 16 23:46:57 tsc systemd[1]: test.service: Consumed 43min 41.408s CPU time.
Feb 16 23:46:58 tsc systemd[1]: test.service: Scheduled restart job, restart counter is at 2.
Feb 16 23:46:58 tsc systemd[1]: Stopped test/Test Code.
Feb 16 23:46:58 tsc systemd[1]: test.service: Consumed 43min 41.408s CPU time.
Feb 16 23:46:58 tsc systemd[1]: Started test/Test Code.

I keep connection on and read data every 30 seconds. What could be a cause and solution for this? Is it possible that bluetooth connection is lost and hence the failure? Thank you!

ukBaz commented 1 year ago

You are correct, there does appear to be a Not Connected error happening. With such a big gap between reads it might worth checking that the device is still connected before doing the read.

Here is an example of how to check Connected

$ python 
Python 3.9.2 (default, Mar 12 2021, 04:06:34) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import BLE_GATT
>>> ubit_address = 'E1:4B:6C:22:56:F0'
>>> ubit = BLE_GATT.Central(ubit_address)
>>> ubit.device.Connected
False
stekicar commented 1 year ago

Thank you for reply. I will test it with 'while ubit.device.Connected == False:' loop.

stekicar commented 1 year ago

I created a loop with:

 while ubit.device.Connected == False:
      try:
         ubit.connect()

and then I try to read:

try:
   byte_data = bytes(ubit.char_read(CHAR_UUID))

And it still crashes my code occasionally:

Mar 29 21:14:16 tsc python3[4353]:     byte_data = bytes(ubit.char_read(CHAR_UUID))
Mar 29 21:14:16 tsc python3[4353]:   File "/home/pi/.local/lib/python3.9/site-packages/BLE_GATT/central.py", line 81, in char_read
Mar 29 21:14:16 tsc python3[4353]:     return self.chrcs[uuid.casefold()].ReadValue({})
Mar 29 21:14:16 tsc python3[4353]:   File "/home/pi/.local/lib/python3.9/site-packages/pydbus/proxy_method.py", line 72, in __call__
Mar 29 21:14:16 tsc python3[4353]:     ret = instance._bus.con.call_sync(
Mar 29 21:14:16 tsc python3[4353]: gi.repository.GLib.Error: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Operation failed with ATT error: 0x0e (36)

Error is now different type. Any idea why?

Searching error code I found:

#define | ATT_ERR_UNLIKELY   0x0e

The attribute request that was requested has encountered an error that was very unlikely, and therefore could not be completed as requested.

ukBaz commented 1 year ago

There is not enough information in your question for me to attempt to reproduce this. If I were to speculate, it would be that you are getting a different error message because it is a different type of failure.

There are various tools and logs you can look at to get more debug information (depending where the issue is).

Have the following running in different terminals at the same time as your script is running might reveal if more detailed errors are being reported in them.

For high-level reports of BlueZ activity there is:

For the OS level interactions there is:

The D-Bus interactions are in:

Low-level Bluetooth stack:

Can you quantify "occasionally"?

Does your try/except block not capture the exception? If it does, can you do some checks to see what the state of the connection etc is at that point? This might give some clues as to what is going wrong.

stekicar commented 1 year ago

Thank you for reply! "occasionally" is totally random. It can happen within and hour after start, or once a day, or once a week. I have daemon restart code when it crashes.

journalctl -f -u bluetooth returns:

-- Journal begins at Wed 2022-09-21 20:23:41 EDT. --
Mar 17 18:50:11 tsc bluetoothd[406]: sap-server: Operation not permitted (1)
Mar 17 18:50:11 tsc bluetoothd[406]: Failed to set privacy: Rejected (0x0b)
-- Boot d27a976b544545c0b7693bd7925ce4a6 --
Mar 26 17:23:08 tsc systemd[1]: Starting Bluetooth service...
Mar 26 17:23:08 tsc bluetoothd[404]: Bluetooth daemon 5.55
Mar 26 17:23:08 tsc bluetoothd[404]: Starting SDP server
Mar 26 17:23:09 tsc systemd[1]: Started Bluetooth service.
Mar 26 17:23:09 tsc bluetoothd[404]: Bluetooth management interface 1.21 initialized
Mar 26 17:23:10 tsc bluetoothd[404]: profiles/sap/server.c:sap_server_register() Sap driver initialization failed.
Mar 26 17:23:10 tsc bluetoothd[404]: sap-server: Operation not permitted (1)
Mar 26 17:23:10 tsc bluetoothd[404]: Failed to set privacy: Rejected (0x0b)

I will check other options you mentioned too.

stekicar commented 1 year ago

I run parallel process and these are error I get:

1. INFO:root:01-04-2023 18:54:08 - Controller: readBT error raised: 'UUID 7a95ce01-0ea8-1bcc-71a2-fc7539b81c9c not found'
2. INFO:root:01-04-2023 18:54:36 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Software caused connection abort (36)
3. INFO:root:01-04-2023 18:57:07 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Software caused connection abort (36)
4. INFO:root:01-04-2023 18:58:45 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Software caused connection abort (36)
5. INFO:root:01-04-2023 19:05:41 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Software caused connection abort (36)
6. INFO:root:01-04-2023 19:09:03 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Disconnected early (36)
7. INFO:root:01-04-2023 19:09:48 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Software caused connection abort (36)
8. INFO:root:01-04-2023 19:10:43 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Software caused connection abort (36)
9. INFO:root:01-04-2023 19:11:00 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Disconnected early (36)
10. INFO:root:01-04-2023 19:12:21 - Controller: readBT error raised: g-io-error-quark: GDBus.Error:org.bluez.Error.Failed: Disconnected early (36)
ukBaz commented 1 year ago

I'm not sure what you are showing me here. What is writing these logging messages? Are these just the error message or are you getting a full stack trace? Is this from one run? Are there successful reads in between these timestamps?

Are you still staying connected and reading every 30 seconds what the battery level is? How do you check the connection has not been lost before you do the read? For example, the device could have gone out of range. Does the peripheral go in to sleep or power saving mode? I'm assuming 7a95ce01... is the battery characteristic... Why is that not found on the device

Those error messages aren't messages that I recognise so I can give you any great insight based on just them and I am not able to reproduce the issue so I'm not sure how I can help.

This appears to be something specific in your setup (or I don't have enough information to reproduce the problem). The messages suggest the connection is going in to some unexpected state. It is question of breaking down the various parts to see if you can identify which bit that is.

Couple of suggestions for things to try.... Don't stay connected, Have a simple script that connects, reads the value, and then disconnects. Have the script called by a cron job. Does that work?

btmon output can be read by wireshark. Maybe some analysis of the low level information will reveal some insight on how to narrow down where the issue is.

stekicar commented 1 year ago

I run code with connect, read values, disconnect in the loop and in another parallel process I was catching errors and saved these errors in log file. Above text show errors I was getting while running code. I was looking for values every 30 seconds so I was getting data from bluetooth unit in between with no problems. Some of these errors crash my code, some do not. I am using try: except: to catch error but it looks like ble_gatt do not pass these errors to python so it can avoid code crashing.

ukBaz commented 1 year ago

BLE_GATT is controlling BlueZ at a fairly high level. A bit more research suggests those errors are coming from low-level BlueZ which is why they are not being caught by Python try:except blocks.

I'm going to close this issue due to lack of activity and this being a BlueZ issue not BLE_GATT.