Debugging disconnects - Githubissues

andriej commented 4 years ago

Since few days I encounter "Disconnected: none" error in HA log from pyotgw.protocol. I'm trying to debug what is happening, today I've enabled logger: debug on this.

Here is the log from the moment it first happened after restart. Is it possible to say that it's rather connectivity problem (LAN-LAN to OTGW) or could be also HA issue? Or - as there is introduction in the log, the gateway did restart itself?

mvn23 commented 4 years ago

Looks like a hiccup on the connection. As long as it doesn't happen too often and it recovers like it did I wouldn't worry too much about it.

andriej commented 4 years ago

How about 5 times during like 2-4 hours? And it didn't happen some time ago at all (I mean like not a single time per day)

mvn23 commented 4 years ago

That seems a bit much indeed. If HA is really busy when this happens, it could happen that the watchdog hits its timeout while the connection is still alive (i.e. _inform_watchdog() does not run for 5 seconds - the default timeout). This will depend on the system you are using to run HA and all other integrations and processes on that machine. To test this, you can try to increase the inactivity_timeout on line 46 of pyotgw.py to give HA more time to complete other tasks before pyotgw will reset the connection.

andriej commented 4 years ago

New log, I've left it for some time to run and I've got 60 disconnects. The 16:11 bugs me.

The machine is not that busy, it's J3455 intel with 8GB RAM. HA instance is quite big but anyway, didn't see that much errors earlier:

I'm not that keen on increasing the time as I think it would be better to hunt down what is making machine/HA busy so much it can't take a lot more.

mvn23 commented 4 years ago

It looks like increasing the timeout won't work anyway. The reconnect is triggered after exactly 5 seconds every time, so it's unlikely that something is blocking or delaying the message processing. Do you see the same behavior when connecting the gateway via USB (if possible)? This will rule out or confirm that it's a network issue.

andriej commented 4 years ago

Unfortunately it's 2 floors away so can't USB-confirm the problem. During the timeouts pings are delivered (1 packet lost and >4000 sent from 2 LAN machines) so it's not 'wire' problem so far.

Since yesterday I've turned off logging, today tho in logs I see one new event:

and Disconnected happend almost 60 times:

mvn23 commented 4 years ago

Latest master should give some more insight in the state of the serial connection when a reconnect is triggered.

andriej commented 4 years ago

Will be checking it now. It would be nice to somehow align the data with unavailability of other platforms (if there are any) - I hope there will be issue https://github.com/home-assistant/home-assistant/issues/30167 in HA sometime fixed for such situations. Maybe before next winter season :)

andriej commented 4 years ago

Log from last night (with updated both .py files from master) - there has been disconenction without further reconnect, whole data in HA frozen since that moment. TCP adapter works (it puts up interface normally).

Strange is that even restarting gateway doesn't make HA reconnect anyway - it stopped it's retries, as it can be seen in log, had to restart right now.

andriej commented 4 years ago

And from this morning - connection dead, again, no restarts taken by HA/pyotgw?

andriej commented 4 years ago

I had to revert to latest release - with master HA couldn't get any stable connection to OTGW.

mvn23 commented 4 years ago

Oops, that's what I get for testing with python 3.8 and not reading the docs properly. Fixed in 9157766

edit: The fact that we're not seeing any error message in the log for this (I had to manually catch and log the exception to see where it went wrong) may be an indication that the error message for the connection drops is suppressed as well.

andriej commented 4 years ago

That's why... I was wondering how adding log outputs could possibly destroy communication. ;-) Anyway after fix here are some logs - disconnection still happens (usually soon after OT command but it's not a rule I think) and here is log from all of them with few seconds before and after the reconnection. I was away from home so no other jobs or LAN-cables handling in meantime.

edit: and some more logs

mvn23 commented 4 years ago

It really looks like the data just stops coming in at some point. There is no data left in the input buffer - both the pyserial input buffer and the pyotgw line buffer are empty - and the connection is not actively closed before it happens.

At this point I'm out of ideas. Short of analyzing network traffic I really don't know where to go from here. If you could post a dump of the network traffic on your HA machine along with another log then we can 100% rule out or confirm that it's a network issue. You can dump traffic with the following command:

tcpdump -w otgw.pcap -i <interface> host <ip> and port <port>

Where is the network interface you're using to communicate with the gateway, the gateway ip and the gateway port for serial traffic. The dump will be placed in your working directory as otgw.pcap. Note that this dump will contain the ip addresses and ports used for the communication. Normally they will be local addresses which are unreachable from the internet anyway so it's completely safe to include them, but since you've redacted them in your logs so far I thought I would let you know. If this is a problem we'll have to find another way to send them over.

andriej commented 4 years ago

I had an lucky moment to catch the error 3 minutes after start of dump. Here is the link for it (I'm aware of IP-related data in it, it's on my dropbox which I will remove later on).

mvn23 commented 4 years ago

Great! Can you upload the accompanying log as well?

andriej commented 4 years ago

Sure, here is the log.

andriej commented 4 years ago

The problems started to be seen after I've encountered some other problems (with esphome api integration). I didn't do anything special lately, just hass.io and some dockers on machine. Updated Debian Buster to 10.3, kernel bump for like 1 subversion, no kern.log errors or dumps (or anything error-related during the day).

mvn23 commented 4 years ago

Thanks! Looks like it's definitely an external issue, either on the network or on the gateway. There is no traffic at all on the line between 20:38:19.495623 and 20:38:24.512225. You can check this with tcpdump -r otgw.cap and look for the timestamps. I don't know why this just started happening now while it was working fine before. Either way, it's not something we can fix in pyotgw apart from detecting and reconnecting, which is what's already happening. This is why I'm closing this issue.

What I did notice is that your gateway is sending the data byte by byte, whereas my gateway is sending the data line by line. This is not very efficient and definitely not necessary in this case. You can try looking for an option in the firmware of your gateway's serial to network adapter. An update of this firmware could possibly help as well. However, this should not cause the observed connection drops as the data will be merged on the receiving side before processing. The only weird thing is that all logged disconnects since this afternoon's commit happened after a newline \r\n which may indicate that the problem is not network-related (would drop the connection anywhere in the middle of a message as well), but rather exists on the gateway itself or (more likely) on its serial to network converter.

andriej commented 4 years ago

Could you point which adapter do you have exactly?
Those are the settings I can make at the gateway's LAN interface: (TCP Server/Client and UDP Server/Client which are not the case for socket I guess)

If you could point which data to mangle:

Is the disconnection making sensors unavailable on the roadmap for OTGW? Could be easy to detect when OTGW is offline/hang up.

andriej commented 4 years ago

@mvn23 I guess the "data packing size" is option for me, just don't know what's the best value to set it at. Typical message is 9 bytes, right?

Anyway, you've directed me to the point of dropping packets without any sense and I started googling my LAN drivers from the hardware. As far as I remember problems started soon after last dist-upgrade, so I've found open and closed-sourced network drivers for my card. Checking the closed one now, I hope it was it. Later on I will investigate dropping managed switch in between.

It doesn't seem to be kernel issue so now I'm debugging network-switch. And this is how cable disconnection looked like.

mvn23 commented 4 years ago

I have a USRIOT USR-TCP232-T2 with firmware V4017. It's not the best module on the market - password length is limited to 5 characters for example :-1: - but I have no issues with connection stability.

Your settings seem fine, but you could indeed try to adjust the Data Packing options. The regular messages are 11 bytes in size (9 bytes data + 2 bytes \r\n) and get sent 2-4 times per second depending on the gateway status and message types. Command responses have a more irregular character. If you set the Data Packing Size to a relatively low number you should be fine. If you specify a higher size (say more than 88) then you should make sure to specify a time to avoid triggering a timeout, i.e. maximum 45 (4.5 seconds) to keep a bit of a margin as well. I would personally start with something like 11 or 22 bytes and a timeout of 5 (0.5 seconds).

Edit: unavailable entities is planned, but I can't give an ETA. I will have to implement it in pyotgw first. This is planned for 0.6.

andriej commented 4 years ago

@mvn23 thanks for the ETA.

I've found the issue - it was one of my non-important APs. It was failing (throwing kernel panics) and during those failures it was doing kind of packet-loops and influencing switches on the network. Turning it off solved the whole issue, no further Discconect for 24 hours so far - so situation came back to normal (except I have gateway sending 11 bytes instead of 1 byte every packet ;-))

I would find it fast because there was no tcpdump trace of issue - except for the switch blinking it's 1gbit LED on the AP's link.

mvn23 / pyotgw

Debugging disconnects #13