OTGW disconnecting again?

andriej commented 5 years ago

Is 0.5b0 used in 0.100.2? Since this version I've started to have disconnect issues again (so OTGW seems connected, but no updates)

It is even more strange that I've been using OTGW without issues for a quite long time. Also, I have LAN gateway so no wifi issues

mvn23 commented 5 years ago

Reviewing the changes in HA between 0.99.3 and 0.100.2 I don't see anything that could cause this. pyotgw 0.5b0 will be in the next HA version (probably 0.101) but the only change there is an added disconnect() function. Does your HA log provide more info? Please enable debug logging in HA for the following modules:

homeassistant.components.opentherm_gw
pyotgw.pyotgw
pyotgw.protocol

andriej commented 5 years ago

So far I try to find situation which may cause issues. So far everything seems stable (still), will turn on debug after I find something may cause it.

andriej commented 5 years ago

@mvn23 I've been able to reproduce it 'easily', just changing value via UI. I've redacted the quite-long log

after startup
normal operation (T/B messages received)
change via UI - where T/B doesn't show anymore
hard reset of power - back to operation

Here is the log - if there's anything to check, let me know.

andriej commented 5 years ago

During the outage HA showed OTGW as available and even the readings as you can see:

andriej commented 5 years ago

Last time I've been using iSense workaround fix - as it was no-heating time of year, I'm right now not sure if it got migrated into the final code.

I've tried to reproduce issue, so far without success.

mvn23 commented 5 years ago

The iSense fix was integrated in v0.4b3 and I can tell from the logs that it is active in your setup, but thanks for mentioning it as it may be related. In the log you provided, there seem to be 7 minutes missing around line 145. During that time, the temperature override was activated - seemingly without triggering the issue - while later on changing the override does trigger the issue. How did you activate the override the first time?

andriej commented 5 years ago

Via HA UI

mvn23 commented 5 years ago

Could you include that part of the log as well?

andriej commented 5 years ago

I misunderstood, here's the missing part

mvn23 commented 5 years ago

Are you making any calls to the opentherm_gw.set_outside_temperature service? Maybe in an automation? Can you confirm that this is breaking stuff when called manually?

andriej commented 5 years ago

I'm making indeed - I don't have physical outside temp. sensor connected to boiler so I emulate it with my Aqara zigbee one. Everytime it change it's state, it's reported to boiler (and showed properly on thermostat too).

- alias: "OTGW: set outside"
  initial_state: true
  trigger: 
    platform: state
    entity_id: sensor.temperature_zewnatrz
  action:
    service: opentherm_gw.set_outside_temperature
    data_template:
      temperature: "{{ states('sensor.temperature_zewnatrz') }}"
      gateway_id: "termostat"

I have just tested it by calling it from 'Service' part of HA and - funny part - it showed the -2.5 value (-2.3 reported) on thermostat, but other parts didn't update (like temp change). In log it seems it got stuck 10 minutes at least before my test, as I was restarting HA. here's the log

mvn23 commented 5 years ago

Thanks for the log. Also here it seems that the issue is triggered by the outside temperature command. Can you run some tests with the automation disabled? Just to see if manually running the command causes the issue? I can't reproduce it locally so anything that can narrow down the root cause will help a lot.

andriej commented 5 years ago

One more test-pattern I've run. Here is the log and here are the steps: 22:28 - HA restart 22:29 - OTGW loads and propagate values 22:30 - 20.5 propagates (set before restart) 22:30 - changed to 21.5 (21 is default on thermostat) 22:31 - changed to 'back to program' on thermostat -> 21 -> HA shows the change 22:33 - on thermostat i set 20 it seems dead now, no further messages (B/T in debug)

I will disable the automation right now and restart (power on/off) the gateway

andriej commented 5 years ago

So here it is with only one outside report quite long before dead-lock. Log file pattern: 22:36 -> 20 -> 21 on thermostat (21 default, "continue program") 22:38 -> 21 -> 22 on thermostat [propagated to HA] 22:39 -> 22 -> 20.5 on HA UI 22:40 -> set outside for -2.3 (default from example) 22:41 -> 20.5 to 21 on thermostat 22:42 -> 21 to 21.5 in HA UI connection dead (no B/T messages)

andriej commented 5 years ago

I'd like to come back with any new findings, but since 2019-10-21 02:50:38 (last restart of HA) connectivity is working. I've restarted HA just right now to get logfile lower (it got like >20MB in size). Also, what I can't see in log are reports of temperature outside, even tho automation is enabled. It may be because I was changing some things without restarts, will check.

andriej commented 5 years ago

Just random finding. Next day without 'issue', except - temperature changed first on thermostat, then i UI gives 'O' command all the time in the log. Except, it still works. logfile maybe it will broaden search for issue

mvn23 commented 5 years ago

That would be the iSense fix doing its thing.

andriej commented 5 years ago

Ok, glad it's working then. :-) Will look in the log for any issues if they appear.

andriej commented 5 years ago

Got it! It 'stopped' working exactly the moment that scheduled (on thermostat) value kicked in for night. log

there's gap in time at the end of log after 23:01:18

andriej commented 5 years ago

I've tried to 'reset gateway' with service from HA. It looks in log like it still can see the values previously, even tho there's no data incoming. Maybe it's worth checking if the B/T messages are there? Just an idea. log of service reset

what has helped (just wanted to test) was pulling out and putting back in the RJ45 LAN cable. watchdog triggered and otgw has reconnected logfile from the moment - I guess it confirms it's somehow worth checking the B/T messages

mvn23 commented 5 years ago

The fact that commands are still working and the watchdog is still active leads me to believe that the connection with the gateway remains intact, but the message processing task gets stopped/blocked somehow. Of course I did not build in enough debugging output to confirm this, so we'll have to modify some files first. If I provide you with some .patch files, can you modify your existing pyotgw files or would you prefer I create a custom component?

andriej commented 5 years ago

I can do both, but patch files seems easier :-) I have it in venv so will be easy.

andriej commented 5 years ago

One more situation when it was all hanged and re-plug of LAN cable made it reconnect. log

andriej commented 5 years ago

Just informational: I've moved to 0.101 (if it make any difference for patch files)

mvn23 commented 5 years ago

Sorry for the delay, I will have time to look at this later this weekend.

andriej commented 5 years ago

Not a problem, I will also be able to start debugging since sunday minimum. :-) Good thing is, that untouched - it just works.

mvn23 commented 5 years ago

Can you reproduce the issue with the following patch applied to protocol.py and post the debug log? logging.diff

andriej commented 5 years ago

First try: thermostat -> to 22.5 HA -> to 22 thermostat -> to 22.5 thermostat -> to 21 [dead? didn't update in HA] (tried) HA -> to 23 [temperature changed on thermostat] [22:29] thermostat -> to 20.5 (again) [no update in HA] [22:30] HA -> to 22.5 [updated on thermostat... what is happening?] [22:31] HA -> to 22 [ok] [22:33] HA -> to 21 [22:33] thermostat -> to 20.5 [not updated in HA, HA is on 21] [22:35] thermostat -> to 20 [not updated in HA, HA is on 21] [22:36] HA -> to 19 [thermostat updated to 19...] [22:36] HA -> to 18.5 [ok] [22:37] thermostat to 18 [no updated in HA, stayed on 18.5] [22:38] on thermostat I cancel override and back to scheduled 21 HA still on 18.5 [22:39] thermostat -> 21.5 [HA still 18.5] [22:40] HA -> 20 thermostat updated So far I can't reproduce total connection loss which worries me somehow, as patch is the only LOGGER lines added. logfile

andriej commented 5 years ago

[22:56] HA -> 20.5 [22:57] HA -> 20 [22:58] TSTAT -> 19 [22:59] TSTAT -> 20 (not visible in HA, HA still on 19) [23:00] TSTAT -> cancel override, back to 21 from schedule automatically (HA stay still and show 19) [23:01] TSTAT -> 21.5 (so back to overriding scheduled value; HA still on 19) [23:02] HA -> 22 (ok, value changed after a while on thermostat, first 'override' hand icon then the value propagates) [23:03] TSTAT -> back to 19 (HA still on 22) [23:04] TSTAT -> cancel override, back to 21 (HA still on 22)

So instead of 'dying' now I have more like a one-way solution. logfile

mvn23 commented 5 years ago

Looks like my suspicion is being confirmed: The standard T/B/R/A messages stop being processed for some reason and the message queue keeps growing. On both logs you provided, the last message being processed shows a line like this: Processing: A xx 09 xx xx (this is exactly where the iSense quirk is handled) Can you try a few times and just confirm that the last processed message before the command queue starts growing is one of those? In that case it's definitely a bug in the iSense quirk handling and I will know where to look. This will also explain why I'm not seeing the issue at home.

mvn23 commented 5 years ago

Can you try this possible fix? maybe_fix.diff

andriej commented 5 years ago

[16:32] TSTAT -> 28.5 [16:40] HA -> 28 (didn't change on termostat so...) [16:43] HA -> 27.5

so no effect on iSense while message counter 'works', doesn't go up. logfile

mvn23 commented 5 years ago

I can tell from the log that your thermostat is not receiving the updated temperatures. The problem is that the thermostat is supposed to request these updates from the OpenTherm Gateway (message T00090000). In the last log you posted (with fix) this does not happen at all. As the initial problem was likely triggered by the gateway's response to these messages, there's no way to tell if it's fixed from this log. Do you have an option to reset your thermostat? If it's powered by the gateway this can be achieved by unplugging and re-plugging the gateway.

andriej commented 5 years ago

Sorry I was resetting my iSense to default and could turn turn off remote control. Ignore last log, will re-check properly all settings before next try!

andriej commented 5 years ago

[18:31] HA -> 21 [18:31] TSTAT -> 21.5 [18:33] HA -> 22 (for a moment HA showed 20 as setpoint and then proper 22 - I know it's as it should be, just notifying) [18:34] TSTAT -> 22.5 [18:35] HA -> 23 (again, 20 and updated to 23 seconds after) [18:35] TSAT -> 23.5 [18:36] HA -> 24 (again, 20 -> 24) [18:37] TSTAT -> cancel override back to default 20) [18:37] HA -> 21

ALL of above commands worked properly, problem seems to be fixed! I'm once again sorry for the reset of iSense and forgot the option of 'remote access' not turned on. I've seen it as F200 error - no external temperature sensor while I was setting whole thing. :-)

Logfile just to confirm it works.

mvn23 commented 5 years ago

Fixed in 6962e5ad9158646452d90f27a3ad2841fb608c03

mvn23 / pyotgw

OTGW disconnecting again? #10