home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.15k stars 30.19k forks source link

OpenTherm Gateway serial connection issue #101518

Closed TomEkk33 closed 11 months ago

TomEkk33 commented 1 year ago

The problem

Hello, I see the following report in the log while OpenTherm gateway is trying to communicate to the HW, when setting are read or write. During this all OpenTherm's sensors became "unavailable", and then after few seconds they back to live again. Hardware is connected to HA using socket://localip:25238 over WiFi.

Generally OpenTherm gateway works, but such log is generated from time-to-time

What version of Home Assistant Core has the issue?

core-2023.10.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

OpenTherm Gateway

Link to integration documentation on our website

https://www.home-assistant.io/integrations/opentherm_gw/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Logger: homeassistant
Source: util/async_.py:121
First occurred: 09:35:11 (1 occurrences)
Last logged: 09:35:11

Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/local/lib/python3.11/site-packages/serial_asyncio/__init__.py", line 417, in _call_connection_lost
    self._serial.close()
  File "/usr/local/lib/python3.11/site-packages/serial/urlhandler/protocol_socket.py", line 104, in close
    time.sleep(0.3)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 164, in protected_loop_func
    check_loop(func, strict=strict)
  File "/usr/src/homeassistant/homeassistant/util/async_.py", line 121, in check_loop
    raise RuntimeError(
RuntimeError: Detected blocking call to sleep inside the event loop. Use `await hass.async_add_executor_job()`; This is causing stability issues. Please report issue

Additional information

No response

home-assistant[bot] commented 1 year ago

Hey there @mvn23, mind taking a look at this issue as it has been labeled with an integration (opentherm_gw) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `opentherm_gw` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign opentherm_gw` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


opentherm_gw documentation opentherm_gw source (message by IssueLinks)

mvn23 commented 1 year ago

Connection drops are common on the WiFi version of the gateway. Unfortunately we never identified a real root cause. The log message is a known bug in pyserial-asyncio that gets triggered upon disconnect. Since the gateway comes back online right away it's safe to say that the reconnect logic is working properly, so it looks like both Home Assistant and pyotgw are doing what they should.

You could try to flash different firmware on your WiFi chip (I assume an ESP8266 variant?) to see if it helps. As long as the firmware can pass through the raw serial data to the network it should work, but there have been some problems with specific releases in the past.

TomEkk33 commented 1 year ago

Thank you for your answer. Yes I have OTGW from NodoShop and Wifii WeMos D1 Mini with its firmware dated 20-03-2023 (from NodoShop) The Wifi router is 100 cm away from HW.

OTGW firmware is the latest one. obraz

I'm not sure if this is the right place for reporting this issue, so let me know when can I submit this.

Occasionally I also see the following logs:

2023-10-06 06:13:00.903 WARNING (MainThread) [pyotgw.commandprocessor] Unknown message in command queue: SC: 06:13/5
2023-10-06 06:13:00.903 WARNING (MainThread) [pyotgw.commandprocessor] Command PR failed with SC: 06:13/5, retrying...
2023-10-06 06:32:00.972 WARNING (MainThread) [pyotgw.commandprocessor] Unknown message in command queue: SC: 06:32/5
2023-10-06 06:32:00.972 WARNING (MainThread) [pyotgw.commandprocessor] Command PR failed with SC: 06:32/5, retrying...

when I try to change required temperature settings to start boiler heating from HA. Flame doesn't start.

Gateway overwriting is also enabled.

mvn23 commented 1 year ago

Don't connect multiple things to the gateway at the same time. See also the warning in the docs.

TomEkk33 commented 1 year ago

Thanks, yes I know this. Only HA is connected nothing else.

mvn23 commented 1 year ago

You may want to double check that. Something other than Home Assistant is trying to set the clock on our gateway. Otmonitor for example does this by default I believe.

TomEkk33 commented 1 year ago

Thank you, I checked all other sources and nothing else is trying to use OTGW.

What else should be enabled allowing Climate.Dietrch (name in my case) to start Flame in Dietrich boiler by rising the target temperature in HA?

When I did this using physical thermostat (SALUS WQ610 RF) it works, and Climate.Dietrch in HA reflects exact setting of that thermostat (WQ610 RF's target temperature change is visible instantly in Climate.Dietrch ). Flame is started.

TomEkk33 commented 11 months ago

I tried to find a pattern with

Command PR failed with SC: 06:13/5, retrying...

however it happens 5 to 8 times per 24h with random SC values. Nothing else is connected to OTGW, just HA. Maybe it happens when Wifi is reconnected.

mvn23 commented 11 months ago

I tried to find a pattern with

Command PR failed with SC: 06:13/5, retrying...

however it happens 5 to 8 times per 24h with random SC values. Nothing else is connected to OTGW, just HA. Maybe it happens when Wifi is reconnected.

This may be caused by the firmware you're using. Nodo-shop refers to this firmware on their website, which also interprets the protocol and sends some commands to the gateway (the changelog refers to SC among others). Despite what they state on their wiki, this makes it incompatible with the native OpenTherm Gateway integration as it relies on exclusive access to the serial protocol.

TomEkk33 commented 11 months ago

Thank you for your reply. Can you tell me which (if any other) firmware should be OK for this OTGW from Nodo-shop? Currently I use original FW version as shipped from the shop, and HA OpenTherm connects to 25238 port using socket://ip:25238 I do not use MQTT

rotilho commented 11 months ago

Edit: I found the issue. The root cause was me disallowing egress traffic from the gateway so it could not access the NTP server.


I suspect I may be affected by this issue. I've also been using OTGW from Nodo Shop since last year, and it worked flawlessly through fall and winter.

Yesterday, I realized the rooms were not heating up, so I began to investigate (I'm using it in Standalone mode). I haven't been able to pinpoint the issue yet, but I'm starting to believe it may be related to socket disconnections. In my case, the entities don't reach the point where they become unavailable, but the CH setpoint resets to 0 after a minute.

I have the latest firmware now, but I only upgraded when I noticed the resets.

image

mvn23 commented 11 months ago

Thank you for your reply. Can you tell me which (if any other) firmware should be OK for this OTGW from Nodo-shop? Currently I use original FW version as shipped from the shop, and HA OpenTherm connects to 25238 port using socket://ip:25238 I do not use MQTT

I haven't tested any specific firmware, but anything that passes the serial data without altering or adding any data should work.

the CH setpoint resets to 0 after a minute.

This is a feature, not a bug. It was introduced in OpenTherm Gateway firmware 5.2 as a safety precaution. See also the changelog and the command documentation.

Limit the validity of a CS command. A remotely set control setpoint expires after just over a minute. This is a safety feature to prevent runaway heating when the controlling program loses its connection, or crashes.

CS=temperature Control Setpoint — Manipulate the control setpoint being sent to the boiler. Set to 0 to pass along the value specified by the thermostat. To stop the boiler heating the house, set the control setpoint to some low value and clear the CH enable bit using the CH command. A CS command with a value of 8 or higher must be repeated at least every minute as long as adjustment is needed. This is a vigilance check to prevent runaway heating in case the controlling program loses its connection, or crashes. Warning: manipulating these values may severely impact the control algorithm of the thermostat, which may cause it to start heating much too early or too aggressively when it is actually in control. Examples: CS=45.8, CS=0

If you want the CH setpoint to persist, please create a timer to repeat the command every minute (or less).

rotilho commented 11 months ago

@mvn23, I fixed the problem by allowing egress traffic from the gateway (to access ntp server). However, I wasn't aware that you need to repeat the setpoint in standalone mode; I thought it would maintain the setpoint as long as the socket is open. I believe the Nodo firmware abstracts this for Home Assistant, since I don't need to repeat the setpoint myself

TomEkk33 commented 11 months ago

@mvn23 I haven't tested any specific firmware, but anything that passes the serial data without altering or adding any data should work.

Can you tell what firmware works in your HW?

@rotilho I fixed the problem by allowing egress traffic from the gateway (to access ntp server).

A connection to my local NTP server (instead of default external) is enabled now. Seems that this change reduced number of failed SC commands a bit in my case, but "Command PR failed with SC:" warning still exists

Can you point me how to override the thermostat temperature (WQ610 physical device) to start boiler heating using HA? https://github.com/home-assistant/core/issues/101518#issuecomment-1750286212 Should I switch to MQTT to be able to do this?

Error doing job: Exception in callback SerialTransport._call_connection_lost(None)

Does anyone used DEV version of WeMos D1 Mini FW using https://github.com/rvdbreemen/OTGW-firmware/tree/dev last updated 12-Sep-2023? Does it help for any of issues?

I have version 0.10.2 obraz

In reboot_log.txt of "FSexplorer ESP" I found:

2023-10-16 14:49:28 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-12 12:14:58 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-10 05:07:34 - reboot cause: Hardware Watchdog (1) 
ESP register contents: epc1=0x40103f2a, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-10 00:40:56 - reboot cause: Hardware Watchdog (1) 
ESP register contents: epc1=0x40104281, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-07 21:55:59 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-07 13:31:16 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Maybe this is some hint for developer ...

rotilho commented 11 months ago

I believe you can't use standalone mode if you have a thermostat plugged.

In the standalone mode I just call set_control_setpoint.


  - conditions:
      - condition: state
        entity_id: binary_sensor.house_heating
        state: "on"
    sequence:
      - service: opentherm_gw.set_control_setpoint
        data:
          gateway_id: house
          temperature: >-
mvn23 commented 11 months ago

@mvn23 I haven't tested any specific firmware, but anything that passes the serial data without altering or adding any data should work.

Can you tell what firmware works in your HW?

I have a gateway with a wired ethernet module, so no ESP firmware. Something like ESPEasy or esp-link should work but there have been issues with some versions in the past, mostly related to an unstable network connection.

Can you point me how to override the thermostat temperature (WQ610 physical device) to start boiler heating using HA? #101518 (comment) Should I switch to MQTT to be able to do this?

I don't see that particular thermostat in the OpenTherm Gateway equipment matrix. To override the room setpoint on the thermostat it needs to support message ID 9. You can use otmonitor to check that by looking in the Log tab for something like this:

16:46:08.785025 T00090000   Read-Data   Remote override room setpoint: 0.00

Where the timestamp will of course be different. The rest of the line should be the same. Using otmonitor while Home Assistant is connected to the OpenTherm Gateway may trigger some errors in your Home Assistant log, please ignore those. If you see those messages but it does not override the setpoint, you may want to confirm that your OpenTherm Gateway is set to gateway mode.

In reboot_log.txt of "FSexplorer ESP" I found:

2023-10-16 14:49:28 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-12 12:14:58 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-10 05:07:34 - reboot cause: Hardware Watchdog (1) 
ESP register contents: epc1=0x40103f2a, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-10 00:40:56 - reboot cause: Hardware Watchdog (1) 
ESP register contents: epc1=0x40104281, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-07 21:55:59 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000
2023-10-07 13:31:16 - reboot cause: Exception (2) - Access to invalid address (29)
ESP register contents: epc1=0x4000df64, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Maybe this is some hint for developer ...

This is not related to Home Assistant. If you experience problems with unexpected reboots of your gateway with that firmware, please report it to https://github.com/rvdbreemen/OTGW-firmware

TomEkk33 commented 11 months ago

Thanks for your answers. The WQ610RF is not officially supported. Temperature set by OTGW is changed for a while (max 1 minute). Then it is overwritten back by WQ610RF thermostat.

Summarizing: there is no known solution to fix the:

  1. "Error doing job: Exception in callback SerialTransport._call_connection_lost(None)
  2. "Command PR failed with SC:" warinings