HA stops trying to reconnect to an ESPHome device that becomes "Unavailable" (is offline) for 19+ hours

xlfe commented 5 years ago

Home Assistant release with the issue: 0.87.0

Last working Home Assistant release (if known): N/A

Operating environment (Hass.io/Docker/Windows/etc.): Docker

Component/platform: ESPHome https://www.home-assistant.io/components/esphome/

Description of problem:

Device that has been offline for a while is no longer available when it comes back online - Home Assistant is no longer retrying the connection.

See the log below. Initially the connection is retried, but after (approx) 19 hours of trying, it no longer retries.

Rebooting HA fixes it and it is available again. Is this a problem with HA? I can't really grok https://github.com/home-assistant/home-assistant/blob/4a559cd4df7fec2e8cb6c0e629f30e0e985b2b94/homeassistant/components/esphome/__init__.py#L335 except it looks like it should keep retrying unless the component is no longer in hass.entity[DOMAIN] on https://github.com/home-assistant/home-assistant/blob/4a559cd4df7fec2e8cb6c0e629f30e0e985b2b94/homeassistant/components/esphome/__init__.py#L300

Problem-relevant configuration.yaml entries and (fill out even if it seems unimportant): N/A - node added automatically

esphomeyaml:
  name: temp_kitchen
  platform: ESP32
  board: lolin32

wifi:
  ssid: SSID
  password: PASSWORD
  manual_ip:
    static_ip: 10.1.1.109
    gateway: 10.1.1.1
    subnet: 255.255.255.0

logger:

api:

ota:

i2c:
  sda: 32
  scl: 31
  scan: False

sensor:
  - platform: bmp280
    temperature:
      oversampling: 16x
      name: kitchen_temp
    pressure:
      name: kitchen_pressure
    address: 0x76

Traceback (if applicable):

2019-02-10 10:27:24 DEBUG (MainThread) [aioesphomeapi.client] 10.1.1.109: Connecting to 10.1.1.109:6053 (('10.1.1.109', 6053))
2019-02-10 10:27:34 INFO (MainThread) [homeassistant.components.esphome] Can't connect to ESPHome API for 10.1.1.109: Error connecting to ('10.1.1.109', 6053): [Errno 113] Connect call failed ('10.1.1.109', 6053)
2019-02-10 10:27:34 INFO (MainThread) [homeassistant.components.esphome] Trying to reconnect in 2 seconds
...
2019-02-11 07:29:25 DEBUG (MainThread) [aioesphomeapi.client] 10.1.1.109: Connecting to 10.1.1.109:6053 (('10.1.1.109', 6053))
2019-02-11 07:29:28 INFO (MainThread) [homeassistant.components.esphome] Can't connect to ESPHome API for 10.1.1.109: Error connecting to ('10.1.1.109', 6053): [Errno 113] Connect call failed ('10.1.1.109', 6053)
2019-02-11 07:29:28 INFO (MainThread) [homeassistant.components.esphome] Trying to reconnect in 60 seconds
--
2019-02-11 07:30:28 DEBUG (MainThread) [aioesphomeapi.client] 10.1.1.109: Connecting to 10.1.1.109:6053 (('10.1.1.109', 6053))
2019-02-11 07:30:32 INFO (MainThread) [homeassistant.components.esphome] Can't connect to ESPHome API for 10.1.1.109: Error connecting to ('10.1.1.109', 6053): [Errno 113] Connect call failed ('10.1.1.109', 6053)
2019-02-11 07:30:32 INFO (MainThread) [homeassistant.components.esphome] Trying to reconnect in 60 seconds

2019-02-11 07:31:32 DEBUG (MainThread) [aioesphomeapi.client] 10.1.1.109: Connecting to 10.1.1.109:6053 (('10.1.1.109', 6053))
2019-02-11 07:31:35 INFO (MainThread) [homeassistant.components.esphome] Can't connect to ESPHome API for 10.1.1.109: Error connecting to ('10.1.1.109', 6053): [Errno 113] Connect call failed ('10.1.1.109', 6053)

Additional information:

Note the last Error message there is no "Trying to reconnect in 60 seconds" - this is the bug - why does it not continue to retry??

Ping @OttoWinter

OttoWinter commented 5 years ago

Yes I've been trying to get debug logs for this issue for quite some time now. I think it is a race condition somewhere (prob in aioesphomeapi), but I'm not sure.

One thing that seems weird to me is that "Can't connect to ESPHome API for 10.1.1.109: Error connecting to" would be the last message logged. If so, that would mean something really weird is going on. Can you see if there's maybe a Trying to reconnect in 60 seconds message right after?

OttoWinter commented 5 years ago

Oh just noticed one thing: When tries is great than 1208, the expression

https://github.com/home-assistant/home-assistant/blob/4a559cd4df7fec2e8cb6c0e629f30e0e985b2b94/homeassistant/components/esphome/__init__.py#L328

will yield

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: (34, 'Result too large')

However I think that would be logged in HA...

xlfe commented 5 years ago

Hmm that sounds like it might be it! 1208/60 = 20.33 hours which is approx how long before HA stopped trying to add the node (probably matches exactly as the first few retries are < 60 seconds)

amelchio commented 5 years ago

Fixed in #21014

home-assistant / core

HA stops trying to reconnect to an ESPHome device that becomes "Unavailable" (is offline) for 19+ hours #21005