home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
74.21k stars 31.16k forks source link

Screenlogic can block startup because discovery does not timeout #49129

Closed bdraco closed 3 years ago

bdraco commented 3 years ago

The problem

Thread 341 (idle): "SyncWorker_1"
    discover (screenlogicpy/discovery.py:30)
    run (concurrent/futures/thread.py:57)
    _worker (concurrent/futures/thread.py:80)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
Screen Shot 2021-04-12 at 10 11 42 AM

What is version of Home Assistant Core has the issue?

dev

What was the last working version of Home Assistant Core?

n/a

What type of installation are you running?

Home Assistant Container

Integration causing the issue

screenlogic

Link to integration documentation on our website

https://www.home-assistant.io/integrations/screenlogic

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2021-04-12 20:11:03 DEBUG (MainThread) [homeassistant.bootstrap] Running timeout Zones: {'screenlogic': <screenlogic: 1 / 0>}
2021-04-12 20:11:03 DEBUG (MainThread) [homeassistant.bootstrap] Integration remaining: {'screenlogic': 185.628832}
2021-04-12 20:11:04 DEBUG (MainThread) [homeassistant.bootstrap] Running timeout Zones: {'screenlogic': <screenlogic: 1 / 0>}
2021-04-12 20:11:04 DEBUG (MainThread) [homeassistant.bootstrap] Integration remaining: {'screenlogic': 186.630341}
2021-04-12 20:11:05 DEBUG (MainThread) [homeassistant.bootstrap] Running timeout Zones: {'screenlogic': <screenlogic: 1 / 0>}
2021-04-12 20:11:05 DEBUG (MainThread) [homeassistant.bootstrap] Integration remaining: {'screenlogic': 187.632289}
2021-04-12 20:11:06 DEBUG (MainThread) [homeassistant.bootstrap] Running timeout Zones: {'screenlogic': <screenlogic: 1 / 0>}
2021-04-12 20:11:06 DEBUG (MainThread) [homeassistant.bootstrap] Integration remaining: {'screenlogic': 188.633663}
Thread 341 (idle): "SyncWorker_1"
    discover (screenlogicpy/discovery.py:30)
    run (concurrent/futures/thread.py:57)
    _worker (concurrent/futures/thread.py:80)
    run (threading.py:870)
    _bootstrap_inner (threading.py:932)
    _bootstrap (threading.py:890)
bdraco commented 3 years ago

cc @dieselrabbit

dieselrabbit commented 3 years ago

@bdraco I'm unable to repro this. Do you have steps? I tried to force it by disabling my ethernet adapter, but it failed gracefully.

This one is odd for me because screenlogicpy.discovery.discover() does set a timeout on the socket, and that bit of code hasn't changed since multi-gateway discovery support was added (prior to this integration).

Could this be related to changes in core? Is there something that needs to change integration-side?

2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.bootstrap] Running timeout Zones: {'screenlogic': <screenlogic: 1 / 0>}
2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.bootstrap] Integration remaining: {'screenlogic': 0.910826}
2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.components.screenlogic.config_flow] Discovered hosts: []
2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.components.screenlogic.config_flow] Discovered gateways: {}
2021-04-13 06:07:12 INFO (MainThread) [homeassistant.setup] Setup of domain screenlogic took 1.0 seconds
2021-04-13 06:07:12 WARNING (MainThread) [homeassistant.components.screenlogic] Gateway rediscovery failed
2021-04-13 06:07:12 ERROR (MainThread) [homeassistant.components.screenlogic] Error while connecting to the gateway {'name': 'Pentair: xx-xx-xx', 'ip': '192.168.xx.xx', 'port': 80}: Unable to connect to 192.168.xx.xx:80
2021-04-13 06:07:12 WARNING (MainThread) [homeassistant.config_entries] Config entry 'Pentair: xx-xx-xx' for screenlogic integration not ready yet: Unable to connect to 192.168.xx.xx:80; Retrying in background
2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event component_loaded[L]: component=screenlogic>
2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.bootstrap] Integration setup times: {'search': 0.000695, 'logger': 0.000885, 'ssdp': 0.000988, 'system_log': 0.00106, 'onboarding': 0.003751, 'frontend': 0.004812, 'person': 0.00783, 'binary_sensor': 0.010662, 'http': 0.011264, 'sun': 0.016065, 'analytics': 0.063443, 'automation': 0.110041, 'script': 0.136037, 'trace': 0.13852, 'zeroconf': 0.141465, 'recorder': 0.163198, 'dhcp': 0.164525, 'auth': 0.197194, 'webhook': 0.265481, 'image': 0.283718, 'tts': 0.38509, 'history': 0.438771, 'timer': 0.452726, 'map': 0.455878, 'blueprint': 0.495363, 'my': 0.500566, 'logbook': 0.597169, 'homeassistant': 0.624956, 'input_boolean': 0.705394, 'scene': 0.725341, 'counter': 0.727556, 'group': 0.733134, 'system_health': 0.740367, 'tag': 0.760273, 'updater': 0.770205, 'media_source': 0.771501, 'input_number': 0.8186, 'input_text': 0.862992, 'input_select': 0.949579, 'zone': 0.953362, 'input_datetime': 0.970986, 'screenlogic': 1.005644, 'websocket_api': 1.377772, 'api': 1.3819, 'device_automation': 1.413299, 'config': 1.638879, 'lovelace': 1.756734}
2021-04-13 06:07:12 DEBUG (MainThread) [homeassistant.bootstrap] Waiting for startup to wrap up
2021-04-13 06:07:12 INFO (MainThread) [homeassistant.bootstrap] Home Assistant initialized in 5.04s
2021-04-13 06:07:12 INFO (MainThread) [homeassistant.core] Starting Home Assistant
bdraco commented 3 years ago

Sadly, repro was only achieved when I was working on the shutdown code and restarting 100s of times.

From the stack, its definitely getting stuck at https://github.com/dieselrabbit/screenlogicpy/blob/master/screenlogicpy/discovery.py#L30

I re-read the code many times and I can't see how it isn't timing out with the udpSock.settimeout(1)

It seems easier to rewrite discovery to use create_datagram_endpoint (https://docs.python.org/3/library/asyncio-eventloop.html#asyncio.loop.create_datagram_endpoint) and be asyncio to avoid the need to set a timeout on the socket then try to replicate/fix it given the hard to replicate steps

bdraco commented 3 years ago

@dieselrabbit I can take a stab at writing an asyncio async_discover if you want.

dieselrabbit commented 3 years ago

@bdraco If you've got the time and inclination, that would be very much appreciated.

I've been trying to wrap my head around asyncio, but had some other changes in-flight that I wanted to get finished first. That said, I think it would still take me some time to even start to implement it.

Was also wondering if it was possible/advisable to further implement asyncio to better utilize the shared socket connection, maybe even going so far as keeping the socket connection alive while in use. Would be interested in your thoughts on that as well.

bdraco commented 3 years ago

@bdraco If you've got the time and inclination, that would be very much appreciated.

I'll take a stab at it if I can do it quickly.

I've been trying to wrap my head around asyncio, but had some other changes in-flight that I wanted to get finished first. That said, I think it would still take me some time to even start to implement it.

Was also wondering if it was possible/advisable to further implement asyncio to better utilize the shared socket connection, maybe even going so far as keeping the socket connection alive while in use. Would be interested in your thoughts on that as well.

Switching to asyncio would make a lot of sense here especially if we can keep the connection open and feed back the results as they come in. It looks like you already have the data processing segmented well in the library so switching out the actual I/O shouldn't be too bad.

bdraco commented 3 years ago

I implemented it in asyncio here https://github.com/dieselrabbit/screenlogicpy/pull/11