sde1000 / python-dali

Library for controlling DALI lighting systems
Other
150 stars 71 forks source link

Lunatone HID: occasional stuck communication, and `assert current_command == None` in `dali/driver/hid.py:592` #124

Open jktjkt opened 1 year ago

jktjkt commented 1 year ago

Every now and then my DALI connection (with Lunatone USB module) gets stuck; sometimes it's once a few days, sometimes it's just after a few minutes. When it failed for the last time, the bus traffic right before that was the following:

2023-07-24 06:56:57,367 dali: INFO: OccupancyEvent(short_address=46, data=EventData(movement=False, occupied=True, repeat=True, sensor_type='movement'))
2023-07-24 06:57:27,398 movement-koupelna-vana: INFO: Dimming automatically
2023-07-24 06:57:27,419 dali: INFO: DTR0(40)
2023-07-24 06:57:27,477 dali: INFO: SetExtendedFadeTime(<address (control gear) 19>)
2023-07-24 06:57:27,507 dali: INFO: DTR0(0)
2023-07-24 06:57:27,565 dali: INFO: SetFadeTime(<address (control gear) 19>)
2023-07-24 06:57:27,596 dali: INFO: ArcPower(<address (control gear) 19>,1)
2023-07-24 06:57:27,625 dali: INFO: DTR0(19)

I've added some async task debugging inspired by from python/cpython#91048 and some abuse of the traceback module, so that upon a SIGUSR1 I get a dump of asyncio.all_tasks and what they are waiting for. Apart from the usual, boring stuff, I saw this sequence of awaits:

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 201, in Room.run
    await self.on_event(event)
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 144, in Room.on_event
    await self.do_auto_dim()
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 115, in Room.do_auto_dim
    await self.activate_quick_fade()
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/svetla-dali-mqtt/svetla/movement.py", line 90, in Room.activate_quick_fade
    await self.dev.send(SetExtendedFadeTime(self.target))
    self = <Room at 0x7fad22c99050 name=koupelna-vana state=auto-dim timeout=8 ignore=False>

  File "/home/jkt/work/prog/python-dali/dali/driver/hid.py", line 203, in hid.send
    response = await self._send_raw(command)
    self = <dali.driver.hid.tridonic object at 0x7fad22c9ae50>

  File "/home/jkt/work/prog/python-dali/dali/driver/hid.py", line 444, in tridonic._send_raw
    await event.wait()
    self = <dali.driver.hid.tridonic object at 0x7fad22c9ae50>

  File "/nix/store/k43ppvy4jxbvg65gwihsn48c10nnaali-python3-3.11.4/lib/python3.11/asyncio/locks.py", line 213, in Event.wait
    await fut
    self = <asyncio.locks.Event object at 0x7fad22ad63d0 [unset, waiters:1]>

I still have no idea why that happens, but then I killed the program via Ctrl-C, and the following was printed:

2023-07-24 07:07:02,917 asyncio: ERROR: Task exception was never retrieved
future: <Task finished name='Task-2' coro=<tridonic._bus_watch() done, defined at /home/jkt/work/prog/python-dali/dali/driver/hid.py:477> exception=AssertionError()>
Traceback (most recent call last):
  File "/home/jkt/work/prog/python-dali/dali/driver/hid.py", line 592, in _bus_watch
    assert current_command == None
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Apparently, the _bus_watch_task is created, but never await-ed. That looks like a bug, but I'm not familiar enough with asyncio and its use from the context of the HID driver to fix this. I also do not know if this is related to my original issue with locking up. As far as I can tell, the asyncio.Event that the _send_raw is waiting on is never touched from the _bus_watch_task, but perhaps there's some subtle interaction that I'm not aware of. Or, maybe, if there's an exception in this task it somehow affects the IO loop? That's probably unlikely because it's the same IO loop which handles my SIGUSR1, and that one worked just fine and produced the trace.

Does that make sense, and are these two issues perhaps related to each other?

Would you like me to get more debugging logs from the driver so that (maybe) we will know what exactly happened before that assertion was thrown? (I'll have to RTFM for logger selectivity because it's very, very verbose.)