gnmic not re-attempting targets when unavailable

adietrich-ussignal commented 1 year ago

I've noticed when devices become unavailable, or an unexpected RPC error occurs, there is no attempt to continually re-attempt the subscriptions. This becomes particularly concerning that gnmic can't recover in a situation where device is down for planned maintenance, or just can't get an individual RPC call out for some reason.

Error messages where this occurs can be seen below:

2023/03/27 15:04:52.876435 [gnmic] target "Switch01": subscription intf_counters rcv error: retrying in 10s
2023/03/27 15:04:52.876469 [gnmic] target "Switch01": subscription cpu rcv error: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings"

The configuration appears as follows:

  Switch01:
    address: Switch01.example.com:6030
    timeout: 120s
subscriptions:
  cpu:
    paths:
      - /components/component/cpu/
    mode: STREAM
    stream-mode: sample
    sample-interval: 10s
    encoding: bytes
  intf_counters:
    paths:
      - /interfaces/interface/state/
    mode: STREAM
    stream-mode: sample
    sample-interval: 10s
    encoding: bytes
  bgp_stats:
    paths:
      - /network-instances/network-instance[name=*]/protocols/protocol[identifier=BGP][name=BGP]/bgp/neighbors/neighbor[neighbor-address=*]/afi-safis/afi-safi[afi-safi-name=IPV4_UNICAST]
    mode: STREAM
    stream-mode: sample
    sample-interval: 10s
outputs:
  prometheus:
    type: prometheus
    listen: 0.0.0.0:9804
    path: /metrics
    expiration: 60s
    timeout: 30s

karimra commented 1 year ago

If a target goes down you don't see log entries for retrying a subscription recreation because the underlying gRPC client connection needs to be reestablished first. Once the gRPC client is reestablished, you should see attempts at creating the subscriptions again which should eventually succeed.

I just tested this and it seems to work. I created a subscription, restarted the router, once up gNMIc was able to reestablish the connection.

Was your conclusion based on logs alone or you didn't see any client connected and no subscription attempts on the router side ?

adietrich-ussignal commented 1 year ago

When the target comes back up, I only see a few re-attempts, but then nothing beyond that. Once that occurs, the grpc connection never re-establishes to the target, and I no longer get any data collected for that target in prometheus.

ddidziz commented 1 year ago

I have faced the same problem - if I drop sessions from gNMIc server to target device, occasionally there is data disruption in prometheus. I managed to replicate issue by dropping sessions on firewall - after some time I see sessions re-established but without data received in prometheus. I am fixing this by deleting sessions for problematic target or by deleting target using gNMIc REST API, but of course. I am looking for solution, that doesn't include manual work. :) Also I have observed, that problem occurs with targets, that generates more data.

karimra commented 1 year ago

@ddidziz this sounds like a different issue from the original one. You do see the connection getting reestablished, but no data being available on the output. Do you see data being sent from the router ? How long was the session down before re-establishment and what's the subscription sample-inverval ?

ddidziz commented 1 year ago

Sorry for long time it took me to reply. I did a packet capture to examine this case. Sample interval is 10s, with updates-only: true and heartbeat-interval: 1m One occasion was when I terminated session, router kept sending packets till successful re-connection, it took 1 minute. Gnmic logs says:

[gnmic] target "router1": subscription bgp rcv error: rpc error: code = Unavailable desc = error reading from server: read tcp 172.21.28.196:36184->10.10.10.25:6030: read: connection timed out
[gnmic] target "router1": subscription bgp rcv error: retrying in 10s

Another occasion was when I terminated session and router stopped sending packets after some random time. This time re-connection didn't happen. Gnmic log entry:

[gnmic] target "router1": subscription bgp rcv error: rpc error: code = Internal desc = unexpected EOF
[gnmic] target "router1": subscription bgp rcv error: retrying in 10s

I noticed when unexpected EOF message is received, connection isn't reestablished.

stevemil00 commented 10 months ago

I did a lot of looking at this over the last few days, for sample subscriptions (updates-only is false and there's no heartbeat timeout, or at least I'm not configuring either).

I think that if a client connection has made it to the point of adding subscriptions, it can lose its connection and it'll recover; at least, if I gNMI to something via a ssh tunnel and kill the tunnel once things are up, it'll reestablish.

It doesn't log when it does reestablish the connection or when it resends the subscriptions, and it might be good if it did that. It's easy to jump to the conclusion that the connection dropped and didn't come back up because one sees the "it broke" logs but never any corresponding "it's ok again" logs.

But I am pretty sure that there's a race. You can't just create a Target and then subscribe without connecting explicitly, because target.Client is nil but that gets dereferenced really early on. So you have to use, or it seemed like you had to use, target.CreateGNMICLient() to get connected, then make your subscription requests.

However, if the connection fails between getting connected and making the subscription requests (say, because you have a bunch of consumer threads for different things, and they all have a startup delay, just to pick an example, totally at random 😄 ), the connection -- and, thus, the subscriptions -- will never be reestablished.

How all of this would change for an on-change subscription, or some other subscription that isn't getting data routinely, I am not certain.

openconfig / gnmic

gnmic not re-attempting targets when unavailable #82