vapourismo / knx-go

KNX clients and protocol implementation in Go
MIT License
91 stars 59 forks source link

Problem in Telegraf that connection loss from KNX-Router is not recognized #70

Closed stonie2oo4 closed 1 year ago

stonie2oo4 commented 1 year ago

Hello, I have an issue in Telegraf which use this/your Code. The problem is the following, when the KNX-Router lost connection (for example during update of switch or router) then Telegraf don't recognize it. After the connection with the KNX-Router is reablished, telegraf doesn't know it. It works only after I restart Telegraf manually.

I already opend a Report at the Telegraf Repo. https://github.com/influxdata/telegraf/issues/13632

There powersj asked me to file here an upstream issue, if there is a way to check when the server goes down via an error?

vapourismo commented 1 year ago

Unfortunately, KNX routing doesn't have a connection concept. Therefore a disappearing router should not be a problem - you simply don't get any events during its absence.

Are you sure you're using a KNX router and not a gateway/tunnel?

stonie2oo4 commented 1 year ago

Thank you for your fast reply. Yeah l‘m pretty sure 😅. It‘s a KNX-Router from Enertex. https://www.enertex.de/d-katalog-router.html

But I‘m using the tunnel connection, because the KNX-Router and Telegraf are in two different Subnets.

When I open a telnet connection to the KNX-Router I see the tunnels who are used. Tunnel 1 and 2 are Logicmachine and Visu in the same subnet as the KNX-Router. Tunnel 3 are Telegraf. After the connection is lost and I reconnect to the KNX-Router via telnet, Tunnel 1 and 2 are back again, but tunnel 3 is not. Only after I restart Telegraf, the connection on tunnel 3 is back again.

It‘s as must the tunnel connection be opend again that it works the right way. I‘m no coder and sorry if my question Is for you a little dumb 😅, but is it possible that at the start of the „knx-plugin“ a tunnel gets opend and during ist runtime not? Which would be normally a good thing ☺️.

And when there Is no query if the connection Is up (again) that the tunnel get not opend again?

Or is it more possible that the problem is my setup? During normal operation the plugin works really really smooth 👍.

By the way, sorry for my bad english, I hope you unsderstand what I want to explain 😅.

vapourismo commented 1 year ago

That device may be called "router" but it is actually a 2-in-1 router+gateway. You're using it as a gateway (e.g. tunnelling mode).

In this case it would be good to have access to the knx-go logs.

Normally the knx-go tunnel will try to reconnect if the gateway requested a disconnection or when the gateway fails the heartbeat check. Seemingly this doesn't work in your case.

stonie2oo4 commented 1 year ago

Is it possible to get this log with only the telegraf installation, or must I install the knx-go package seperatly?

vapourismo commented 1 year ago

I would hope the former is possible but I don't know telegraf at all.

Alternatively, you could adapt the group tunnel example from this project's README.

stonie2oo4 commented 1 year ago

Ok, thank you very much for your help. I have now tested the connection with multicast, was a little bit a hassle to get it work over vlans. But what should I say, now it works how it should :). After connection is lost and after it comes back, telegraf reconnect automatic. Maybe it is possible that the heartbeat is only sent over multicast?

For me this solution is nearly perfect. But if my question about the heartbeat is true, maybe you could include this info in the readme? Thank you once more for your time :).

vapourismo commented 1 year ago

The opposite is actually true, routing does not use heartbeats at all whereas tunnelling does.

And I need to stress this, KNX routing has no concept of a connection. Hence no connection or reconnection will take place ever. This is exactly why it works for you, the absence of the KNX router is of no concern to knx-go. It simply doesn't receive any routing indications while the KNX router is offline. Once it comes back online, the routing indications are multicast again and this package will receive and process them again.

It would still be useful to have logs of the issue with your tunnelling connection in order to establish what knx-go might be doing incorrectly.

stonie2oo4 commented 1 year ago

That makes sense 😅. I try to install the knx-go package seperatly to get the right logs. But I think not before the end of the week, maybe weekend.

WolfgangD commented 1 year ago

I have the same original problem, although I'm using a Gira X1 which only supports tunneling. Whenever there are connection problems, I need to restart telegraf. The tunnel is not automatically reconnected.

I've tried UDP and TCP, without success.

Here is some context of what the telegraf knx plugin uses to connect at start-up of the plugin:

     case "tunnel_tcp":
        tunnelconfig := knx.DefaultTunnelConfig
        tunnelconfig.UseTCP = true
        c, err := knx.NewGroupTunnel(kl.ServiceAddress, tunnelconfig)
        if err != nil {
            return err
        }
        kl.client = &c
    case "router":
        c, err := knx.NewGroupRouter(kl.ServiceAddress, knx.DefaultRouterConfig)
        if err != nil {
            return err
        }
        kl.client = &c

// ...

    // Listen to the KNX bus
    kl.wg.Add(1)
    go func() {
        kl.wg.Done()
        kl.listen()
    }()

In kl.listen() it calls kl.client.Inbound() to receive the KNX messages.

I guess the questions are: 1) Is the reconnect transparently done by knx-go and should therefore work as done by the telegraf plugin OR does the telegraf plugin need to listen for a disconnect and try reconnecting by itself? 2) If there is a transparent automatic reconnect, is the setup code missing something in order to enable this?

I'm happy to make a PR for the necessary changes to the telegraf plugin, so I'd appreciate your help.

vapourismo commented 1 year ago

I've partially answered these already above but let me rephrase/summarise the expected behaviour:

As long as the Inbound channel is open, this package will maintain communication with the gateway (in the case of tunnelling). This means from Telegraf's perspective, if it can pull from the channel, there is nothing for it do - knx-go will handle heartbeat timeouts and disconnections. However, there is the response timeout that knx-go adheres to. This means if it tries to reconnect but doesn't get a response in time, it will give up which results in a closed Inbound channel.

I think there is an opportunity to add another option which could configure knx-go to re-try connection requests for an extended period of time. At the moment it will only try to do so once. I believe this to be the culprit given Telegraf might not handle the closed Inbound channel gracefully.

To assess whether this would address the issue, I need some logs. Are you able to extract those and post here?

WolfgangD commented 1 year ago

Got it. I'm not sure knx-go would be the right place then. It only makes sense if this would be needed by other "users" of it.

As far as telegraf plugin in goes, it should be straightforward to implement retries using exponential backoff. I'll do my research on this and once I find time to implement will do a PR for the telegraf plugin.

Thansk for your help and this great library!

userwithoutpassword commented 8 months ago

today I had the same problem. Im using KNX-IP-Interface. My switch reboot at 3 in the Night and now all data since that time are missing...

vapourismo commented 8 months ago

today I had the same problem. Im using KNX-IP-Interface. My switch reboot at 3 in the Night and now all data since that time are missing...

Not sure what you expect, this library does not store anything. It only provides access to the data.