home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.66k stars 29.95k forks source link

KNX entities switching constantly from available to unavailable #59170

Closed arcano81 closed 2 years ago

arcano81 commented 2 years ago

The problem

After upgrading to HA last version 2021.11.1 the KNX entities continue switching from availble to Not avalable image

What version of Home Assistant Core has the issue?

2021.11.1

What was the last working version of Home Assistant Core?

2021.10.9

What type of installation are you running?

Home Assistant OS

Integration causing the issue

KNX

Link to integration documentation on our website

https://www.home-assistant.io/integrations/knx/

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2021-11-05 15:28:01 WARNING (MainThread) [xknx.log] Received DisconnectRequest from tunnelling sever.
2021-11-05 15:28:06 WARNING (MainThread) [xknx.log] Heartbeat to KNX bus failed. No active communication channel.

Additional information

No response

farmio commented 2 years ago

@josiasmontag Don't know what causes this, but every Frame is listed twice in your capture. It's most probably some capturing configuration issue so I'll ignore this 🤷‍♂️

The first disconnect is different, the other 2 seem also to be caused by a blocked loop. This picture is showing the 2nd disconnect.

Bildschirmfoto 2022-01-10 um 11 44 58

You can see the Telegram 33 was ACKed in 0.000696 seconds. Telegram 34 didn't get an ACK in time (~1 second) and is repeated by the KNX interface. After an additional second the KNX interface sends a DisconnectRequest - as it should. After additional 0.2 seconds xknx is unblocked again and works its queue - it ACKs both Telegrams and answers the DisconnectRequest (which both is pointless since we don't have a connection anymore anyway).

What happened between 33 and 34 remains a mystery 🤷

Julius2342 commented 2 years ago

I see the same problem - interestingly only with one device (kitchen window constantly switching from off to unavailable and back).

josiasmontag commented 2 years ago

Just captured a disconnect with the KNX-only HA instance. It looks different:

image

knx-only-instance.pcapng.zip

farmio commented 2 years ago

You could try to checkout and run https://github.com/home-assistant/core/pull/63766 and see if it logs anything. This should find integrations that block the event loop by using time.sleep (my assumption is that some integrations lib does this from a discovery flow or something like that - I could be very wrong though).

@josiasmontag Yes it looks like the first disconnect of the other capture. I have no idea why this occurs.

I'd try to find out why you see all packets twice.

farmio commented 2 years ago

https://github.com/XKNX/xknx/pull/847 would also be worth a try - its a more or less experimental version of xknx which puts its connection (acking etc.) logic in a separate thread. It works fine on my system (but then again the current release version also works fine here 🤷).

arcano81 commented 2 years ago

XKNX/xknx#847 would also be worth a try - its a more or less experimental version of xknx which puts its connection (acking etc.) logic in a separate thread. It works fine on my system (but then again the current release version also works fine here 🤷).

This is a great news! I would like to try it on my test and production system. @farmio Can you explain me how to install the experimental version?

farmio commented 2 years ago

@arcano81 see https://xknx.io/home_assistant.html#running-ha-with-local-xknx-library It should be enough to git clone, checkout the PR and set the environment variable.

It will probably only work with a venv installation - Container will probably be more complicated and HaOS I have no idea.

farmio commented 2 years ago

63766 was just merged to dev. It's easier to just try current HA dev branch and see if that logs any warnings.

See https://developers.home-assistant.io/docs/development_environment

farmio commented 2 years ago

Small update for the experimental xknx version: We decided to ship this, so we disabled threading by default again. If you want to try manually set threaded=True in your local clone here https://github.com/XKNX/xknx/blob/main/xknx/io/connection.py

arcano81 commented 2 years ago

I switched my production installation to dev and updated the Core and the Supervisor. HA Update The system is up and running since 10 hours and received 7 Disconnect Request from tunneling server but no warning for usage of time.sleep blocking the main event loop in the register.

farmio commented 2 years ago

Hm... it is not included in 20220111 - was merged after that build - it should be in todays build though.

That said there are other blocking calls than time.sleep which are not detected by this - it's just picking low-hanging fruit.

mag2bue commented 2 years ago

Hello all,

since I upgraded my NAS with RAM, the KNX interface is running stable.

-> my NAS no longer needs virtual memory -> when taking a snapshot from my VM I still lose the connection -> If you have a NAS, do not use the shared memory.

Stable with HA 2021.12.9 and 2021.12.10

Julius2342 commented 2 years ago

My problem disappeared when commenting our DWD app. (They were blocking the event loop, causing disconnects which then made some entities to be flaky.)

farmio commented 2 years ago

So a new beta has started yesterday. Please try it and see if the connection problems with KNX disappear - or if there are any log messages regarding "timer_out_of_sync" (homeassistant.core debug) or similar.

arcano81 commented 2 years ago

Great news !Thankyou. How can I get this beta version for a try?

farmio commented 2 years ago

That depends on your installation method. A quick search at Google or HA community forum will tell you.

arcano81 commented 2 years ago

I updated my production instance of HA to last beta versione (2022.2.0b3). The problem with knx disconnection became even worse than before.With this beta version diconnections happens every 2/3 seconds. I attach here the log from HA. Log KNX beta version (2022.2.0b3).log

marvin-w commented 2 years ago

@arcano81: Can you please add the KNX diagnostics to the issue? You can download it on the integrations page after clicking on the three dots on the KNX integration.

Also, please send another log with

debugpy:

active in your configuration.yaml and add debug logging also to homeassistant.core please.

farmio commented 2 years ago

@arcano81 thanks for these logs, we found a bug in the route_back logic. Try disabling route_back - you probably don't need it as your HA device and interface are on the same network anyway.

arcano81 commented 2 years ago

@farmio Disabling the route back solved the problem of disconnection every 2/3 seconds. Still present the usual behaviour with random diconnections. @marvin-w I attach here the KNX diagnostics config_entry-knx-e0f52016e11797923387412866023651.zip and here the HA log with debugpy and homeassistant.core. (The logs start from bootstrap of HA and ends 10 minutes after KNX tunneling disconnection). You can find a disconnection @ 2022-01-30 22:44:46 Ha Log (2022.2.0b3).zip

Last questions: How do I use the individual address in KNX configuration?Which address do I have to use? image Thanx

farmio commented 2 years ago

@arcano81

How do I use the individual address in KNX configuration?Which address do I have to use?

You don't need to configure any address here. A tunnelling server should assign an address for a client at connection. Some (rare) don't, yours does: "current_address": "1.0.6" If you would be using routing you'd need to set it.

What is this entity: entity_id=sensor.ntc_temperature_01 which integration does trigger it and do you have any automations or anything using it? I don't know if this is coincidence, but it changes state near every timer_out_of_sync event or DisconnectRequest.

It doesn't seem debugpy is activated in these logs. It is not a logger but an integration, you'd need to add debugpy: to your configuration.yaml - see https://www.home-assistant.io/integrations/debugpy/

arcano81 commented 2 years ago

The entity_id=sensor.ntc_temperature_01 comes frome ESPHome integration. It's a temperature sensor to acquire data from the hot water boiler. It's not used in any automation, just for showing up its value in the dashboard. I correctly added debugpy:in configuration.yaml and restarted... I'll let you have some new logs tomorrow ! Thankyou

arcano81 commented 2 years ago

@farmio 2022-01-31 00:29:39 WARNING (MainThread) [homeassistant.util.async_] Detected blocking call inside the event loop. This is causing stability issues. Please report issue for debugpy doing blocking calls at homeassistant/components/debugpy/__init__.py, line 49: debugpy.listen((conf[CONF_HOST], conf[CONF_PORT])) The logs is showing this line and HA has become instable, very slow at bootstrap. Try to disable debugpy: and see what happens.

farmio commented 2 years ago

This has just been fixed in beta 4 (#65252). But I think it's expected that HA could run noticeably slower when the debugger is attached.

arcano81 commented 2 years ago

Upgraded to HA 2022.2.0b4 Now the system boots correctly. It's a bit slower but everything is up and running. The KNX disconnection now happens again every 10 seconds even if rout back is disabled. Here you can find logs from HA and diagnostic form KNX integration. config_entry-knx-e0f52016e11797923387412866023651.json.zip home-assistant (2022.2.0b4)+debugpy.zip

I will roll back to HA 2022.2.0b3 which has a tolerable behaviour with KNX disconnection while waiting for a fix. Thanx

farmio commented 2 years ago

@arcano81 to me it seems your HA instance is just a tick too big for the Raspberry, but I could be totally wrong there. There are just so many timeout occasions and asyncio warnings on different parts of the system.

I'd like to try to enable threading for xknx on your HA to see if it fixes at least knx, if you are interested feel free to ping me at Discord farmio#5918 (or at xknx discord)

Blade0001 commented 2 years ago

Dear Folks,

I have the same issue.

My environment:

HA 32bit Core Version: 2021.12.10 on Raspi 4 4 GB LAN Connection MDT SCN-IP000.03 KNX IP Interface FW Version 3.03. Application Programm Vers. 2.1

Till yesterday this error occurs nearly every 5 -6 minutes After farmio gives me the hint I investigated for a couple of hours.

I Found the following issues:

Node-Red I use it as workflow engine in homeassistant.

  1. I found a couple of problems here not sure what impact these single issues had.

Normally I use the ultimate knx plugin which creates on own knx connection next to homeassistant. Unfortunately, you can not set a tunneling group address as well as for xknx

  1. I had a misconfiguration in one node that sent an unexpected payload to a group address xknx.exceptions.exception.CouldNotParseTelegram: <CouldNotParseTelegram description="payload invalid" destination_address="3/1/5" device_name="LED WHZ Voute Außen" feature_name="State" payload="<GroupValueWrite value="<DPTArray value="[0x3]" />" />" source_address="1.1.12"/>

After that happens the connection got lost.

2022-02-02 22:33:43 WARNING (MainThread) [xknx.log] Received DisconnectRequest from tunnelling sever.

  1. I had a second knx integration that was not completely deleted, I deleted it.

Integrations

I've had a couple of buggy integrations that keep trying to connect to a network endpoint and fail.

2022-02-02 22:33:46 WARNING (MainThread) [homeassistant.components.onvif] Couldn't connect to camera 'test', but will retry later. Error: All connection attempts failed 2022-02-02 22:33:51 ERROR (MainThread) [xknx.log] Unexpected xknx error while processing incoming telegram <Telegram direction="Incoming" source_address="1.1.12" destination_address="3/1/5" payload="<GroupValueWrite value="<DPTArray value="[0x3]" />" />" /> homeassistant.exceptions.HomeAssistantError: Error when calling async_set_default for bulb Yeelight Ceiling10 0x652d1fb at 192.168.1.33: {'code': -5000, 'message': 'general error'}

I fixed that or deactivated the integrations.

For now, it looks much better. The last disconnection was 15 hours ago

--

Blade0001 commented 2 years ago

Does it make sense to add hysteresis to the function that reports connection aborted? That this function does not report every single micro disconnect?

farmio commented 2 years ago

@Blade0001

xknx.exceptions.exception.CouldNotParseTelegram: <CouldNotParseTelegram description="payload invalid" destination_address="3/1/5" device_name="LED WHZ Voute Außen" feature_name="State" payload="<GroupValueWrite value="" />" source_address="1.1.12"/>

This looks bad, but actually doesn't cause anything other than this log message:

2022-02-02 22:33:51 ERROR (MainThread) [xknx.log] Unexpected xknx error while processing incoming telegram <Telegram direction="Incoming" source_address="1.1.12" destination_address="3/1/5" payload="<GroupValueWrite value="" />" />

Granted, it should be handled in a less alarming way, but other than that it is not a cause for a disconnect. I'd recommend to update to 2022.02 and see if you get disconnections again.

I had a second knx integration that was not completely deleted, I deleted it.

This however can really mess with your connection.

h4nnes commented 2 years ago

I have (had?!) the same problem and after switching off the support of "slow tunneling connections" in my MDT IP Interface, I don't see any errors anymore regarding the disconnect in my log file. Can anyone confirm this?

Edit: Yesterday I turned on the "slow tunneling connections" again and I got the errors with the disconnect again

@farmio: What is the recommendation for this?