gurumitts / pylutron-caseta

Apache License 2.0
153 stars 98 forks source link

Log filling up with exceptions from Lutron Caseta #126

Closed gwww closed 1 year ago

gwww commented 1 year ago

Moved from Home Assistant to here (ref: https://github.com/home-assistant/core/issues/80340)

This occurs when the Lutron Caseta integration starts up and the bridge is not available on the network (could be due to power, no ethernet, etc).

"Ideal" fix would be to catch the exception, log an error, and use some sort of back off / retry logic. I have implemented something similar in the ElkM1 library I wrote. https://github.com/gwww/elkm1/blob/main/elkm1_lib/connection.py#L60-L75

I can help if you wish. I can't make guarantees on when though.

Copy/Paste of the HA problem report below...

The problem

When my Lutron bridge goes offline there is a large stream of exceptions that get logged. On a low end system (rPi 4 with SD card) this is hard on the system.

The exception below is logged about every 3 seconds:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/pylutron_caseta/smartbridge.py", line 501, in _monitor_once
self._leap = await self._connect()
File "/usr/local/lib/python3.10/site-packages/pylutron_caseta/smartbridge.py", line 116, in _connect
res = await open_connection(
File "/usr/local/lib/python3.10/site-packages/pylutron_caseta/leap.py", line 182, in open_connection
reader, writer = await asyncio.open_connection(host, port, limit=limit, **kwds)
File "/usr/local/lib/python3.10/asyncio/streams.py", line 47, in open_connection
transport, _ = await loop.create_connection(
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1064, in create_connection
raise exceptions[0]
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1049, in create_connection
sock = await self._connect_sock(
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 960, in _connect_sock
await self.sock_connect(sock, address)
File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 500, in sock_connect
return await fut
File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 535, in _sock_connect_cb
raise OSError(err, f'Connect call failed {address}')
OSError: [Errno 113] Connect call failed ('192.168.xxx.xxx', 8081)
2022-10-11 16:37:33.581 WARNING (MainThread) [pylutron_caseta.smartbridge] Reconnecting...

What version of Home Assistant Core has the issue?

2022.10.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

lutron_caseta

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lutron_caseta/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

Catching the exception and logging an error would reduce the amount of noise in the logs by 90+%.

Backing off on retrying would be even better (but that's a feature request, so not part of this report).

danaues commented 1 year ago

Thanks for moving this issue over here.

danaues commented 1 year ago

The exception is supposed to be caught here, but it isn't.

https://github.com/gurumitts/pylutron-caseta/blob/a10efacbc56c4424ed7292604bb34c8946c7f998/pylutron_caseta/smartbridge.py#L497-L528

Maybe the async nature of this call is causing the exception to occur outside of the main loop

https://github.com/gurumitts/pylutron-caseta/blob/a10efacbc56c4424ed7292604bb34c8946c7f998/pylutron_caseta/smartbridge.py#L103

danaues commented 1 year ago

I'm able to catch the exception here no problem, but not sure how to handle it properly.

https://github.com/gurumitts/pylutron-caseta/blob/a10efacbc56c4424ed7292604bb34c8946c7f998/pylutron_caseta/leap.py#L178-L183

gwww commented 1 year ago

Would this work? https://github.com/gwww/elkm1/blob/main/elkm1_lib/connection.py#L66. You can ignore the retry logic and if you decide add the retry logic later.

gwww commented 1 year ago

I see that is more to it! Problem is that the code using create_tls is not error aware.

A library specific exception could be raised. The code using create_tls would have to catch that exception.

If backoff retry logic is implemented (and I believe that is the best solution) then no exception would be raised but the code would log the connection error and retry after a back off. Code using create_tls would still need to timeout and handle the timeout.

danaues commented 1 year ago

I've tried catching the exception in create_tls but it doesn't work. Something to do with the async nature. The exception occurs outside of the loop where we're watching for the exception.

Check out this commit: https://github.com/danaues/pylutron-caseta/commit/1ff14fe004cf092b8ec97e6366c97687d4cddeef

This works, but I'm not sure if it's the correct way to handle it. @mdonoughe am I on the right track here?

This is what my log looks like with this fix. It tries to connect 3 times, then skips and keeps trying in background. No exception raised, just a warning logged

2022-10-15 15:26:13.028 INFO (MainThread) [homeassistant.setup] Setting up switch
2022-10-15 15:26:13.029 INFO (MainThread) [homeassistant.setup] Setup of domain switch took 0.0 seconds
2022-10-15 15:26:13.032 INFO (MainThread) [homeassistant.components.switch] Setting up switch.lutron_caseta
2022-10-15 15:26:14.520 WARNING (MainThread) [pylutron_caseta.leap] Open Connection Failed - OSERROR
2022-10-15 15:26:27.576 WARNING (MainThread) [pylutron_caseta.leap] Open Connection Failed - OSERROR
2022-10-15 15:26:40.632 WARNING (MainThread) [pylutron_caseta.leap] Open Connection Failed - OSERROR
2022-10-15 15:26:46.456 INFO (MainThread) [pylutron_caseta.smartbridge] Processing Smartbridge.close() call
2022-10-15 15:26:46.456 WARNING (MainThread) [homeassistant.config_entries] Config entry '047b23ad' for lutron_caseta integration not ready yet: Timed out while trying to connect to 192.168.165.14; Retrying in background
2022-10-15 15:26:46.459 INFO (MainThread) [homeassistant.bootstrap] Home Assistant initialized in 36.14s
2022-10-15 15:26:46.459 INFO (MainThread) [homeassistant.core] Starting Home Assistant
2022-10-15 15:26:46.489 INFO (MainThread) [homeassistant.components.zeroconf] Starting Zeroconf broadcast

Then if I re-connect the bridge to the network, it's able to reconnect automatically

2022-10-15 15:27:05.398 INFO (SyncWorker_4) [homeassistant.loader] Loaded heos from homeassistant.components.heos
2022-10-15 15:27:13.547 DEBUG (MainThread) [homeassistant.components.lutron_caseta] Connected to Lutron Caseta bridge via LEAP at 192.168.165.14
2022-10-15 15:27:13.548 INFO (MainThread) [homeassistant.components.binary_sensor] Setting up binary_sensor.lutron_caseta
2022-10-15 15:27:13.548 INFO (MainThread) [homeassistant.components.cover] Setting up cover.lutron_caseta
2022-10-15 15:27:13.548 INFO (MainThread) [homeassistant.components.fan] Setting up fan.lutron_caseta
2022-10-15 15:27:13.549 INFO (MainThread) [homeassistant.components.light] Setting up light.lutron_caseta
2022-10-15 15:27:13.549 INFO (MainThread) [homeassistant.components.scene] Setting up scene.lutron_caseta
2022-10-15 15:27:13.549 INFO (MainThread) [homeassistant.components.switch] Setting up switch.lutron_caseta
2022-10-15 15:27:13.549 INFO (MainThread) [homeassistant.components.button] Setting up button.lutron_caseta
^C2022-10-15 15:27:52.080 INFO (Recorder) [homeassistant.components.recorder.pool.RecorderPool] Pool recreating
gwww commented 1 year ago

In your patch, I think that's the right spot to catch the exception. Seems to work well - I was not aware of the background connection checking handled by HA.

What do you think is missing?

danaues commented 1 year ago

I think there might be a larger issue to be solved around exception handling. This fix takes care of this one instance.

It doesn't seem like these exceptions will ever be caught, but I'll have to do some more testing. Maybe the initial connection is different than future disconnection/re-connection attempts.

https://github.com/gurumitts/pylutron-caseta/blob/a10efacbc56c4424ed7292604bb34c8946c7f998/pylutron_caseta/smartbridge.py#L519-L525

danaues commented 1 year ago

Also, I"m only catching OSError in leap.py open_connection() function.
Might have to watch for a few more as well, but maybe not.

mdonoughe commented 1 year ago

This works, but I'm not sure if it's the correct way to handle it. @mdonoughe am I on the right track here?

open_connection returning None instead of raising an exception is not right. There is already a retry loop which is catching the exception and retrying every two seconds as expected. The exception information is logged because it says _LOG.warning("Reconnecting...", exc_info=1). It'd be nice if Python didn't log the entire calling stack frame as part of the exception info.

mdonoughe commented 1 year ago

Actually, looking at the exception again, it is logging correctly. There's just an annoying number of stack frames from within asyncio.

danaues commented 1 year ago

Wow, you're right. It is catching the exception. I completely overlooked that due to the length of the warning message. Really looked like an unhandled exception.

Thanks!

gwww commented 1 year ago

What if the exceptions are caught in leap.py and logged with the type of the exception. Then the smartbridge.py exception handling can be changed up a bit.

So in leap.py

except ( 
     ValueError, 
     ConnectionError, 
     OSError, 
     asyncio.TimeoutError, 
) as exc: 
     _LOG.warning(f"Error connecting: {type(exc).__name__}")
    raise BridgeCantConnectError(...) # This is a new exception; define in __init__.py

Then in smartbridge.py:

except (
     ValueError, 
     ConnectionError, 
     OSError, 
     asyncio.TimeoutError, 
     BridgeDisconnectedError, 
) as exc: 
     _LOG.warning("Reconnecting...", exc_info=1)
     await asyncio.sleep(RECONNECT_DELAY)
# No changes in the above
except BridgeCantConnectError as exc:
     await asyncio.sleep(RECONNECT_DELAY)
    # No log message needed here

Anyway, the above is still not perfect. The problem is that the connect exceptions are intermixed with the run exceptions and I believe that a "correct" fix will mean addressing the mixing of those two concerns.

mdonoughe commented 1 year ago

All of the exceptions caught there should already be related to connection or protocol errors so we don't need a separate path to separate them. The leap module hands off the raw messages to the smartbridge module and errors cannot propagate back from smartbridge.