Stability of RS485 Connection

dehsgr commented 2 years ago

If you're using a RS485 over LAN Adapter there might be some connection interruption. If this occurs TWC Manager seems to not being able to handle this:

07:56:09 ⛽ Manager  20 Unhandled Exception:Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/serial/urlhandler/protocol_socket.py", line 171, in read
    raise SerialException('socket disconnected')
serial.serialutil.SerialException: socket disconnected

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/TWCManager/lib/TWCManager/TWCManager.py", line 806, in <module>
    data = master.getInterfaceModule().read(dataLen)
  File "/usr/src/TWCManager/lib/TWCManager/Interface/RS485.py", line 65, in read
    return self.ser.read(len)
  File "/usr/local/lib/python3.7/dist-packages/serial/urlhandler/protocol_socket.py", line 178, in read
    raise SerialException('read failed: {}'.format(e))
serial.serialutil.SerialException: read failed: socket disconnected

It would be great if we are able to handle such interruptions and perform a full re-connect to the serial interface. I'm running on docker since a few days and restarting the container solves the issue immediatly. If I can assist with more information let me know.

ngardiner commented 2 years ago

Thanks for the output - it should be straightforward for us to catch that exception and reconnect the socket. I'll let you know when there is something ready for you to test

dehsgr commented 2 years ago

@ngardiner is there anything I can assist with to speed-up a possible fix?

ngardiner commented 2 years ago

@dehsgr my apologies for the delay - can I check how you're running TWCManager, is it via Docker or is it a manual install?

I've pushed some code which should fix this for you but wasn't able to test it myself yet, but once I know how you're running it I could help with how you could give it a try

dehsgr commented 2 years ago

@ngardiner no problem. This is open source and I must not expect owners are always free of time for providing fixes!

I‘m running on docker. Changing some files in it is no matter for me.

dehsgr commented 2 years ago

@ngardiner I saw your post in changelog but couldn't find any changes in source code. Am I looking at the wrong location in the repository?

ngardiner commented 2 years ago

Sorry @dehsgr, totally my fault. I have updated it now.

If you'd like to give it a try, I've published a docker image under tag twcmanager:testing which has the fix included. I didn't publish it to twcmanager:latest yet because I also changed the underlying Debian version and didn't want to break the latest image in the process.

If you get a chance to test it and if it's working fine, I'll then update the latest docker image.

dehsgr commented 2 years ago

@ngardiner I just checked out. Your changes seems to fix the crash now. But if serial connection fails (e.g. caused by network latency or network interruption to RS485 adapter) TWCManager logs the following:

07:48:47 ⛽ Manager 20 WARNING: We haven't heard from slave 2118 for over 26 seconds. Stop sending them heartbeat messages.

Is there any way to give heartbeat e.g. 3 retries within 30 seconds? Otherwise I would have to restart docker container (or TWCManager) too on this issue, regardless if network issue was gone away or not.

dehsgr commented 2 years ago

@ngardiner I'm sorry. This was my fault. I completely removed power plug from RS485 adapter. If I just interrupt network connectivity for around 5 secs TWCManager is stable now!

Thank you very much!

I think you might merge this fix into twcmanager:latest branch now.

ngardiner commented 2 years ago

Great to hear the reconnect is working. One thing I am thinking, maybe all I need to do is to reset the last heartbeat counter when we reconnect the serial interface, as TWCManager is continuing to increment the counter while the serial connection is disconnected, during which time the slaves obviously couldn't respond, so it is invalid behaviour and is an easy fix.

That said, while I am at it, I can simply add a parameter for slave heartbeat timeout and you could then override the behaviour with whatever value works best in your environment if need be?

Give me an hour or two and I'll have an updated :testing image you could try with.

dehsgr commented 2 years ago

@ngardiner that sounds comprehensible. I will test immediately when you offer a new image. Thanks a lot! 👏🏻🫡👍🏻

ngardiner commented 2 years ago

No worries @dehsgr, I appreciate the testing (and patience while I got around to getting it fixed), I've updated the :testing image, if you could try a pull of the updated image when you have the chance, it would be great to get your feedback.

The updated code both adds an override for the timeout value and resets the Slave TWC timeout value while we try reconnecting. It would be great to see if it works with the full power down/up without making any config changes. If you do need to override the timeout, the new parameter is:

dehsgr commented 2 years ago

@ngardiner I'm sorry but same as above:

Manager 20 WARNING: We haven't heard from slave 2118 for over 26 seconds. Stop sending them heartbeat messages.

I can only resolve it with a container restart.

reklame33 commented 2 years ago

@dehsgr, You could be right that my issue #473 might be duplicate.

I have pulled the testing image, changed timeout to 60 second, but still my log show that it times out on 26 seconds. Attached is a full 2 minute charge session after restart of the container, start charging to the interrupted charging occure when slave not responded within 26 second. twcmanager log.txt

ngardiner / TWCManager

Stability of RS485 Connection #461