Open adietrich-ussignal opened 1 year ago
If a target goes down you don't see log entries for retrying a subscription recreation because the underlying gRPC client connection needs to be reestablished first. Once the gRPC client is reestablished, you should see attempts at creating the subscriptions again which should eventually succeed.
I just tested this and it seems to work. I created a subscription, restarted the router, once up gNMIc was able to reestablish the connection.
Was your conclusion based on logs alone or you didn't see any client connected and no subscription attempts on the router side ?
When the target comes back up, I only see a few re-attempts, but then nothing beyond that. Once that occurs, the grpc connection never re-establishes to the target, and I no longer get any data collected for that target in prometheus.
I have faced the same problem - if I drop sessions from gNMIc server to target device, occasionally there is data disruption in prometheus. I managed to replicate issue by dropping sessions on firewall - after some time I see sessions re-established but without data received in prometheus. I am fixing this by deleting sessions for problematic target or by deleting target using gNMIc REST API, but of course. I am looking for solution, that doesn't include manual work. :) Also I have observed, that problem occurs with targets, that generates more data.
@ddidziz this sounds like a different issue from the original one. You do see the connection getting reestablished, but no data being available on the output. Do you see data being sent from the router ? How long was the session down before re-establishment and what's the subscription sample-inverval ?
Sorry for long time it took me to reply. I did a packet capture to examine this case. Sample interval is 10s, with updates-only: true and heartbeat-interval: 1m One occasion was when I terminated session, router kept sending packets till successful re-connection, it took 1 minute. Gnmic logs says:
[gnmic] target "router1": subscription bgp rcv error: rpc error: code = Unavailable desc = error reading from server: read tcp 172.21.28.196:36184->10.10.10.25:6030: read: connection timed out
[gnmic] target "router1": subscription bgp rcv error: retrying in 10s
Another occasion was when I terminated session and router stopped sending packets after some random time. This time re-connection didn't happen. Gnmic log entry:
[gnmic] target "router1": subscription bgp rcv error: rpc error: code = Internal desc = unexpected EOF
[gnmic] target "router1": subscription bgp rcv error: retrying in 10s
I noticed when unexpected EOF message is received, connection isn't reestablished.
I did a lot of looking at this over the last few days, for sample
subscriptions (updates-only
is false and there's no heartbeat timeout, or at least I'm not configuring either).
I think that if a client connection has made it to the point of adding subscriptions, it can lose its connection and it'll recover; at least, if I gNMI to something via a ssh tunnel and kill the tunnel once things are up, it'll reestablish.
It doesn't log when it does reestablish the connection or when it resends the subscriptions, and it might be good if it did that. It's easy to jump to the conclusion that the connection dropped and didn't come back up because one sees the "it broke" logs but never any corresponding "it's ok again" logs.
But I am pretty sure that there's a race. You can't just create a Target
and then subscribe without connecting explicitly, because target.Client
is nil but that gets dereferenced really early on. So you have to use, or it seemed like you had to use, target.CreateGNMICLient()
to get connected, then make your subscription requests.
However, if the connection fails between getting connected and making the subscription requests (say, because you have a bunch of consumer threads for different things, and they all have a startup delay, just to pick an example, totally at random 😄 ), the connection -- and, thus, the subscriptions -- will never be reestablished.
How all of this would change for an on-change subscription, or some other subscription that isn't getting data routinely, I am not certain.
I've noticed when devices become unavailable, or an unexpected RPC error occurs, there is no attempt to continually re-attempt the subscriptions. This becomes particularly concerning that gnmic can't recover in a situation where device is down for planned maintenance, or just can't get an individual RPC call out for some reason.
Error messages where this occurs can be seen below:
The configuration appears as follows: