IKEA hub looses connection

fribse commented 3 years ago

The problem

HA looses connectin to the IKEA hub quite fast. The IKEA system itself is working, but after less than a day, HA can't talk to it. if I remove/add the integration, it works again. Sometimes it helps just rebooting the HA. I only have my blinds on the IKEA hub, as that doesn't work well in deconz, all my lother zigbee stuff is in deconz/phoscon.

Environment

Home Assistant Core release with the issue: 0.115.2
Last working Home Assistant Core release (if known): 0.114.x
Operating environment (OS/Container/Supervised/Core): Supervised on VM in pve
Integration causing this issue: IKEA
Link to integration documentation on our website: https://www.home-assistant.io/integrations/tradfri/
zigbee integration foreverything else than blinds: deconz/phoscon

Problem-relevant `configuration.yaml`

Traceback/Error logs

Additional information

probot-home-assistant[bot] commented 3 years ago

Hey there @ggravlingen, mind taking a look at this issue as its been labeled with an integration (tradfri) you are listed as a codeowner for? Thanks! _{^{(message by CodeOwnersMention)}}

jasondefuria commented 3 years ago

Yes! Same issue. I thought it was just me. Have to reboot Home Assistant to get it reconnected!

Mariusthvdb commented 3 years ago

Can't you 'reload' the integration?

jasondefuria commented 3 years ago

Reloading doesn't seem to fix the problem.

Logger: netdisco.mdns
Source: /usr/local/lib/python3.8/site-packages/netdisco/mdns.py:55
First occurred: September 25, 2020, 8:10:27 AM (999 occurrences)
Last logged: 8:46:03 AM

    Failed to add service .
    Failed to add service ocal.
    Failed to add service 8f5-9local.
    Failed to add service 8f5-21local.
    Failed to add service ,(V.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/netdisco/mdns.py", line 53, in _service_update
    service.add_service(zeroconf, service_type, name)
  File "/usr/local/lib/python3.8/site-packages/netdisco/discoverables/__init__.py", line 109, in add_service
    service = zconf.get_service_info(typ, name)
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 2423, in get_service_info
    info = ServiceInfo(type_, name)
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 1773, in __init__
    if not type_.endswith(service_type_name(name, allow_underscores=True)):
  File "/usr/local/lib/python3.8/site-packages/zeroconf/__init__.py", line 273, in service_type_name
    raise BadTypeInNameException("Type '%s' must end with '._tcp.local.' or '._udp.local.'" % type_)
zeroconf.BadTypeInNameException: Type '8d7af2b58f5-12local.' must end with '._tcp.local.' or '._udp.local.'

Logger: coap
Source: /usr/local/lib/python3.8/site-packages/aiocoap/transports/udp6.py:391
First occurred: 8:17:49 AM (1 occurrences)
Last logged: 8:17:49 AM
Connection loss was not expected.

Logger: homeassistant.components.tradfri.base_class
Source: components/tradfri/base_class.py:24
Integration: IKEA TRÅDFRI (documentation, issues)
First occurred: 8:12:21 AM (11 occurrences)
Last logged: 8:12:21 AM

    Unable to execute command <Command put ['15001', 65549]: {'3311': [{'5850': 0}]}>: ('There was an error with the request.', CredentialsMissingError('No suitable credentials for coaps://192.168.4.147:5684/15001/65549'))
    Unable to execute command <Command put ['15001', 65555]: {'15015': [{'5536': 5}]}>: ('There was an error with the request.', CredentialsMissingError('No suitable credentials for coaps://192.168.4.147:5684/15001/65555'))
    Unable to execute command <Command put ['15001', 65551]: {'3311': [{'5851': 5}]}>: ('There was an error with the request.', CredentialsMissingError('No suitable credentials for coaps://192.168.4.147:5684/15001/65551'))
    Unable to execute command <Command put ['15001', 65549]: {'3311': [{'5851': 5}]}>: ('There was an error with the request.', CredentialsMissingError('No suitable credentials for coaps://192.168.4.147:5684/15001/65549'))
    Unable to execute command <Command put ['15001', 65550]: {'3311': [{'5851': 5}]}>: ('There was an error with the request.', CredentialsMissingError('No suitable credentials for coaps://192.168.4.147:5684/15001/65550'))

Logger: pytradfri.api.aiocoap_api
Source: /usr/local/lib/python3.8/site-packages/pytradfri/api/aiocoap_api.py:150
First occurred: 8:12:21 AM (2 occurrences)
Last logged: 8:12:21 AM
    Protocol is shutdown, cancelling command: 192.168.4.147 <Command put ['15001', 65551]: {'3311': [{'5851': 5}]}>
    Protocol is shutdown, cancelling command: 192.168.4.147 <Command put ['15001', 65555]: {'15015': [{'5536': 5}]}>

fribse commented 3 years ago

Great work there @jasondefuria !

jasondefuria commented 3 years ago

Hoping that @ggravlingen can look into this for us.

fribse commented 3 years ago

I see it loosing the IKEA hub within 6 hours, which makes it difficult to temporarily circumvent it with a reboot.

iamthew4lrus789 commented 3 years ago

Experiencing this too. Reboot seems to fix, but works for x hours and stops again. 0.115.5

Logs show hundreds of lines for:

2020-09-29 17:18:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:18:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:19:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:19:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:20:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:20:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:21:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:21:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:22:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:22:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:23:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:23:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:24:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:24:41 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30 2020-09-29 17:25:11 WARNING (MainThread) [homeassistant.components.light] Updating tradfri light took longer than the scheduled update interval 0:00:30

jasondefuria commented 3 years ago

Can anyone assign this to @ggravlingen to look into?

ggravlingen commented 3 years ago

I’m following the thread, but unfortunately there is not much I can do. I’ve used the integration myself for a few years on a RPI3 and it runs like a Swiss watch so it’s quite likely something in your setup.

Are you guys sending heaps of commands to your Tradfri devices? That’s been known to cause issues in the past.

jasondefuria commented 3 years ago

@fribse / @iamthew4lrus789 Are you running on RPI? I am running mine in a vm, wondering if that is also part of the problem.

jasondefuria commented 3 years ago

Not sure if this issue causes this: https://github.com/home-assistant/core/issues/40410, or the other way around. I have a sneaking suspicion that zeroconf cause this issue.

iamthew4lrus789 commented 3 years ago

@fribse / @iamthew4lrus789 Are you running on RPI? I am running mine in a vm, wondering if that is also part of the problem.

Hassio running in a VM (KVM). Had the same issue with it in Docker so set up the VM with the same result.

Both Docker and KVM sitting on top of an Unraid server.

ggravlingen commented 3 years ago

I have seen another issue here where there was a discussion around network settings and other config stuff. You might want to dig that one up and try the solution. (Might also have been a post in the forums.)

jasondefuria commented 3 years ago

@iamthew4lrus789 Are you seeing the zeroconf issues that @fribse and I are also seeing, prior to the tradfri issue?

fribse commented 3 years ago

I'm running the HA in PVE VM, running on a Intel NUC i5. I don't run a lot of commands to the IKEA hub, the only items I have on it are two of my IKEA blinds.

fribse commented 3 years ago

I have seen another issue here where there was a discussion around network settings and other config stuff. You might want to dig that one up and try the solution. (Might also have been a post in the forums.)

From that description that's a really tuff dig :-)

realjax commented 3 years ago

From that description that's a really tuff dig :-)

Maybe (the end) of this thread https://github.com/home-assistant/core/issues/14386#issuecomment-628844113 was meant ?

fribse commented 3 years ago

Could be, but that discusses docker. I've installed the HA with whiskerz007's script: https://github.com/whiskerz007/proxmox_hassos_install So it's in a VM

realjax commented 3 years ago

When I had it running in a VM I had the same problems, then turned to a docker based install and still had those problems. I never tried the macvlan solution discussed there because the docker setup was unclear to me. Only when I moved to a dedicated hassOS install on a raspberry pi4, my tradfri issues went away. Mostly then. It is still iffy, but far from as unstable as it was in the other setups.

fribse commented 3 years ago

I've used RPi4 before, with a SSD on it and all, it was simply too slow, and not stable enough generally, I've been running it as a VM for quite a long time now and it is rock-solid, but with the latest updates it has gone bad with the IKEA hub, and the luftdaten.info

jasondefuria commented 3 years ago

Okay, it looks like for me the issue may have been being in a VM, and with the network changes, there was some confusion in the VM. I shutdown the VM completely is esxi, and I have had no more weird zeroconf or tradfri issues, and everything is now working as expected.

fribse commented 3 years ago

I've tried doing a complete reboot of the pve host, no change.

realjax commented 3 years ago

I'm now having similar problems running hassio on a pi4. It seems to have worsened over the last weeks, not sure what caused it. Not seeing any error or warnings in the logs though.

thiemok commented 3 years ago

I have a similar problem. My network setup is a little more complex, involving home-assistant in a container and the tradfri gateway being on a different subnet than the container host. It also can't be overloading of the gateway, because I only have two bulbs connect with > 1h between updates to the bulbs.

I dug a bit through #14386 and parts of the code of this integration. The investigation in #14386 paints a pretty plausible picture of the UDP connection between home-assistant and the gateway being cut because of inactivity and the integration never properly reestablishing a cut connection. I went with a workaround mentioned in that issue and created an automation which updates the state of one of the bulbs every minute. So far this has helped very well and my gateway remains reachable from home-assistant.

automation:
  - alias: 'tradfri_keep_alive'
    trigger:
    - minutes: /1
      platform: time_pattern
    action:
    - service_template: light.turn_{{ states('light.tradfri_panel') }}
      entity_id: light.tradfri_panel

Now having said that I'm sill wondering if there isn't something the integration can do better here to avoid such workarounds and to handle dropped connections to the gateway more resiliently. Especially because we're talking about a UDP connection with potentially long periods between messages on top of the inherent unreliability of UDP connections. I'm willing to invest some time to improve the integration in this regard, but I'm lacking a lot of context on how the integration and communication with the gateway is supposed to work so maybe @ggravlingen is willing to come up with an approach to this problem with me?

Two options that come to mind are:

some for of periodic keep alive request to the gateway. Similar to what the above automation is doing, but on a lower level an ideally not involving additional devices like bulbs.
Better recovery upon detection of connection loss, like attempting to reestablishing observation on device data and so on. I'm not sure how much of this is done currently, it did not look like something like that was done at all. However I might just be missing it.

Please correct my if I'm wrong about any assumption here, I only had limited time to investigate the existing code and likely missed something

fribse commented 3 years ago

It fits quite well with my experience as I've moved more and more away from the IKEA hub, this it talks much less to it, I only have the blinds on the IKEA hub, and they are maybe operated twice a day. All my light is moved to deconz, and away from the IKEA. Brilliant input @thiemok ! It sounds like a clue.

tpihl commented 3 years ago

I had the same experience. And now I don't. What changed?

I turned off Avahi in pfSense.

Maybe that, or similar, may work for someone else

/T

jasondefuria commented 3 years ago

@tpihl I honestly did the same thing two hours ago- and no issues in two hours. Avahi appears to have been the culprit for me.

thiemok commented 3 years ago

Turning off avahi or messing with other low level network stuff should however not be an acceptable solution to the problem.

That is assuming you did not have a misconfigured avahi setup, which I don't think is the case

jasondefuria commented 3 years ago

@thiemok I actually think it is a misconfiguration between avahi, homekit bridge on Hassio, and zeroconf.

Avahi was (more like became) redundant in my setup, so I am confirming that its removal has fixed it for me, BUT you are correct, it should be further analyzed for what changes in Home Assistant suddenly caused this to become an issue.

thiemok commented 3 years ago

I'm not convinced this problem just surfaced recently in all setups, the issue I linked is about two years old and talks about the same thing.

One of my suspicions is that is that connections to the tradfri gateway are not fault tolerant at all. I quickly tried to support this by powercycling my gateway, which caused the tradfri integration to fail and only reconnect after I reloaded it. Maybe you can try the same to verify its not just my setup that is causing this behaviour @jasondefuria

jasondefuria commented 3 years ago

@thiemok I will verify this on current stable when I get home, but I suspect you are correct from what you said in the past- home assistant requires a restart (or reload of Tradfri) to get it to talk again.

Perhaps there is a way, instead of your method of essentially udp-keepalives, to reload the integration if not reachable?

thiemok commented 3 years ago

Yes, I think resetting the connection is the first thing to do. I also think that there might be a few different issues all essentialy resulting in a dropped connection.

However i think a keep alive or maybe better called hearbeat might still be required, because the device state is tracked through observation from what I could see. (I might be wrong there) Without a heartbeat a disconnection is only detected when homeassistant tries to do an update.

fribse commented 3 years ago

But mdns shouldn't cause this as far as I can see, I have Unifi Dream Machine, and I've had mDNS running for 8 months, it's only the last few weeks I've seen this problem creap up. Disabling it will criple the esphome, so I'm not for that.

iamthew4lrus789 commented 3 years ago

@iamthew4lrus789 Are you seeing the zeroconf issues that @fribse and I are also seeing, prior to the tradfri issue?

@jasondefuria Not seeing any zeroconf errors in my log, no. Apologies for slow response.

realjax commented 3 years ago

automation:
  - alias: 'tradfri_keep_alive'
    trigger:
    - minutes: /1
      platform: time_pattern
    action:
    - service_template: light.turn_{{ states('light.tradfri_panel') }}
      entity_id: light.tradfri_panel

This did fix my tradfri connection problems on a hassio install. I am now running a docker variant to see if it fixes the problems I had there as well.

Edit: I forgot how poor things run in a docker environment. Too many problems to fix first before being able to properly test this. I did see an error regarding this automation though, not sure what it was exactly but I think it couldnt find the entity light.tradfri_panel in my docker setup.

fribse commented 3 years ago

Turning back to the zeroconf, it complains about the formatting of mdns entries. Mine looks like this:

A lot of these devices haven't changed. I can't currently implement the workaround, as all my light is on deconz, but I'll try and move a light from deconz and to the IKA hub,

fribse commented 3 years ago

Ok, so I've added a new LED driver to the IKEA hub in preparation for halloween, I'm going to drive a UV light through that, I hope it works with the workaround.

darksid3r commented 3 years ago

Similar problem here: Supervised Hassio installation on a Raspberry Pi 4. Everything is up to date. Devices connected to the Tradfri HUB continue to appear as available but cannot be controlled anymore, nor there is status update on these devices: E.g. Tradfri light shown as off, click the switch on the UI to turn it on, it turns on in the UI (but not in reality) and back off in UI after a couple of seconds. Restarting HA or reloading the Tradfri integration solves the issue for a few hours.

realjax commented 3 years ago

I agree this needs some attention. The workaround provided by @thiemok works but has it own problems, like stopping after midnight or randomly even..

darksid3r commented 3 years ago

But mdns shouldn't cause this as far as I can see, I have Unifi Dream Machine, and I've had mDNS running for 8 months, it's only the last few weeks I've seen this problem creap up. Disabling it will criple the esphome, so I'm not for that.

@fribse It seems that, at least in my case, mDNS is the issue. I use Unifi SGW as router for my network. A couple of weeks ago I enabled the mDNS reflector on the SGW to ensure broadcasts are replicated across subnets (I have a separate VLAN for IoT devices). Initially I did not think this would be related, but the Tradfri Integration would stop responding after a few hours. Checking the log, I saw a lot of Zeroconf errors which I did not initially relate to the Tradfri integration issue. However, yesterday I've disabled mDNS on SGW and had no more Zeroconf errors and Tradfri integration working flawlessly for more than 24h.

fribse commented 3 years ago

I've been running with a UDM Pro with mDNS setup (I also have IOT and NOT networks), it has worked for 8 months (I had a USG Pro before that). It's only in the latest versions it has started failing. As I see it mDNS is not the cause, but a symptom of a problem. The zeroconf errors also indicate that for some reason it requires some specific mDNS keys, which are not present (and never was).

darksid3r commented 3 years ago

It could very well be the case. However, the Zeroconf errors seem to not be related with the Tradfri integration in particular, but to HA in general. I wonder if HA has a problem that needs fixing to which thr Tradfri integration is particularly "sensitive" to...

fribse commented 3 years ago

The odd thing is that ESPHome works (it relies quite heavily on mDNS for devices without static IP).

samuelbenzaquen commented 3 years ago

Having the same issue, reloading the integration work. I know it's not a long term solution but is there an integration reload service that I could automate to run every couple of hour?

fribse commented 3 years ago

It's going to be interesting to see if the #41784 also fixes this

kpmkpm commented 3 years ago

Also got this error, for couple of days now on my 116.x installation (HA core). Any chance it will be fixed or should I just move the rest of my +50 ikea bulbs to zigbee2mqtt? Somehow my older HA core installation (still 115.6) is working. Edit: the 115.6 seems to be broken too.

I had similar problem back in 0.9x and got it fixed - if I remember right, by disabling the firewall totally on my when CentOS. It was some udp ports which was used in communication between HA and Ikea hub. However, did not work with my current ubuntu 20 installation.

Just asking, if someone is investigating this, if not I see no other way to just transfer everything to z2m.

Edit: I made some changes on my LAN before these issues started again. I used to have DHCP run on my TP-link Deco M9 Mesh network, which both HA cores (ubuntu servers) and Ikea GW were connected. Because the Deco M9 DHCP only supports limited (64?) "static" dhcp addresses I changed my network to get dhcp directly from TP Link R470T+ Load Balance router and changed the Deco M9 mesh into acces point mode. However the Ikea GW and Ubuntu server are currently wired to the same Deco M9 and the network traffic should go directly from one ethernet port of the deco m9 to the other.

Edit: After changing my R470T to use IGMP v3 instead of v2 and also turned the IGMP snooping on the integration has been stable and still work after 9 hours. I also turned on the application optimized balancing but that should not affect since it is used to wan connections only.

Edit: Unfortunately, the symptoms are back, so the IGMP parameters seems not to correct this, as I thought. So, as for now, I think the best I can do is to transfer everything from Ikea GW to zigbee2mqtt...

fribse commented 3 years ago

I've applied 0.116.4 earlier today, and the IKEA still works, and I havn't see any problems with mDNS entries or zeroconf in the logs yet.

fribse commented 3 years ago

It has worked for 24 hrs now, the 0.116.4 has solved this!

fribse commented 3 years ago

Sadly the problem has reappeared, I've lost the IKEA hub, but I don't see any zeroconf errors in the logs now.

home-assistant / core