openconfig / gnmic

gNMIc is a gNMI CLI client and collector
https://gnmic.openconfig.net
Apache License 2.0
170 stars 55 forks source link

Session reestablishment issues #383

Closed LDTips closed 6 months ago

LDTips commented 6 months ago

During testing of gNMIc's behavior when the subscription device (router) was turned off, it was noticed that not always a session is reestablished. The behavior regarding session reestablishment differs for different version - for some gNMIc versions, restarting the router causes the gNMIc to stop receiving data, as it does not try establishing a new session. This is an issue for the continuity of the session - router restart should not require to restart gNMIc as well

Below attached see logs showing behavior for different versions. For older 0.34.x version the behavior is correct, but not for the newest 0.36.x

For 0.34.x

$ docker run -it --rm -p 7890:7890 -v $(pwd):/rpc --net host ghcr.io/openconfig/gnmic:0.34.3 subscribe --insecure -a 192.168.122.101 -u admin -p admin --path /state/system/cpu[sample-period=300]/summary/usage --log
WARNING: Published ports are discarded when using host network mode
2024/02/22 08:51:21.389641 [gnmic] version=0.34.3, commit=d461fa9, date=2023-11-14T02:13:11Z, gitURL=https://github.com/openconfig/gnmic, docs=https://gnmic.openconfig.net
2024/02/22 08:51:21.389708 [gnmic] using config file ""
!! {"name":"default-1708591881","paths":["/state/system/cpu[sample-period=300]/summary/usage"],"mode":"stream"}
2024/02/22 08:51:21.391192 [gnmic] starting output type file
2024/02/22 08:51:21.391522 [file_output:default-stdout] initialized file output: {"FileName":"","FileType":"stdout","Format":"json","Multiline":true,"Indent":"  ","Separator":"\n","SplitEvents":false,"OverrideTimestamps":false,"AddTarget":"","TargetTemplate":"","EventProcessors":null,"MsgTemplate":"","ConcurrencyLimit":1,"EnableMetrics":false,"Debug":false,"CalculateLatency":false}
2024/02/22 08:51:21.391740 [gnmic] queuing target "192.168.122.101"
2024/02/22 08:51:21.391770 [gnmic] subscribing to target: "192.168.122.101"
2024/02/22 08:51:21.391857 [gnmic] starting target "192.168.122.101" listener
2024/02/22 08:51:21.395693 [gnmic] target "192.168.122.101" gNMI client created
2024/02/22 08:51:21.395718 [gnmic] sending gNMI SubscribeRequest: subscribe='subscribe:{subscription:{path:{elem:{name:"state"} elem:{name:"system"} elem:{name:"cpu" key:{key:"sample-period" value:"300"}} elem:{name:"summary"} elem:{name:"usage"}}}}', mode='STREAM', encoding='JSON', to 192.168.122.101
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591881",
  "timestamp": 1708591879927314768,
  "time": "2024-02-22T08:51:19.927314768Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14615884
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14615884"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.43"
      }
    }
  ]
}

{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591881",
  "timestamp": 1708591889918090740,
  "time": "2024-02-22T08:51:29.91809074Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14733407
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14733407"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.45"
      }
    }
  ]
}

[ At this point, the router is rebooted ]

2024/02/22 08:53:47.478755 [gnmic] target "192.168.122.101": subscription default-1708591881 rcv error: rpc error: code = Unavailable desc = error reading from server: read tcp 192.168.122.1:42704->192.168.122.101:57400: read: connection reset by peer
2024/02/22 08:53:47.478901 [gnmic] target "192.168.122.101": subscription default-1708591881 rcv error: retrying in 10s
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591881",
  "timestamp": 1708592036407277039,
  "time": "2024-02-22T08:53:56.407277039Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14701125
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14701125"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "33.44"
      }
    }
  ]
}

And now for 0.36.x the session does not get established after router restarts

$ docker run -it --rm -p 7890:7890 -v $(pwd):/rpc --net host ghcr.io/openconfig/gnmic:0.36.0 subscribe --insecure -a 192.168.122.101 -u admin -p admin --path /state/system/cpu[sample-period=300]/summary/usage --log
WARNING: Published ports are discarded when using host network mode
2024/02/22 08:51:16.841513 [gnmic] version=0.36.0, commit=83d62ed, date=2024-02-13T19:29:11Z, gitURL=https://github.com/openconfig/gnmic, docs=https://gnmic.openconfig.net
2024/02/22 08:51:16.841566 [gnmic] using config file ""
2024/02/22 08:51:16.841774 [gnmic] starting output type file
2024/02/22 08:51:16.842098 [file_output:default-stdout] initialized file output: {"FileName":"","FileType":"stdout","Format":"json","Multiline":true,"Indent":"  ","Separator":"\n","SplitEvents":false,"OverrideTimestamps":false,"AddTarget":"","TargetTemplate":"","EventProcessors":null,"MsgTemplate":"","ConcurrencyLimit":1,"EnableMetrics":false,"Debug":false,"CalculateLatency":false}
2024/02/22 08:51:16.842242 [gnmic] queuing target "192.168.122.101"
2024/02/22 08:51:16.842264 [gnmic] subscribing to target: "192.168.122.101"
2024/02/22 08:51:16.842293 [gnmic] starting target "192.168.122.101" listener
2024/02/22 08:51:16.846160 [gnmic] target "192.168.122.101" gNMI client created
2024/02/22 08:51:16.846192 [gnmic] sending gNMI SubscribeRequest: subscribe='subscribe:{subscription:{path:{elem:{name:"state"} elem:{name:"system"} elem:{name:"cpu" key:{key:"sample-period" value:"300"}} elem:{name:"summary"} elem:{name:"usage"}}}}', mode='STREAM', encoding='JSON', to 192.168.122.101
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591876",
  "timestamp": 1708591875378015875,
  "time": "2024-02-22T08:51:15.378015875Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14517980
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14517980"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.41"
      }
    }
  ]
}

{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591876",
  "timestamp": 1708591885368041968,
  "time": "2024-02-22T08:51:25.368041968Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14719334
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14719334"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.44"
      }
    }
  ]
}
{
  "source": "192.168.122.101",
  "subscription-name": "default-1708591876",
  "timestamp": 1708591895368374507,
  "time": "2024-02-22T08:51:35.368374507Z",
  "prefix": "state/system/cpu[sample-period=300]/summary/usage",
  "updates": [
    {
      "Path": "cpu-time",
      "values": {
        "cpu-time": 14742859
      }
    },
    {
      "Path": "time-used",
      "values": {
        "time-used": "14742859"
      }
    },
    {
      "Path": "cpu-usage",
      "values": {
        "cpu-usage": "2.45"
      }
    }
  ]
}

[ At this point, the router is rebooted ]
[ And nothing more happens since then ]
karimra commented 6 months ago

Thanks for reporting this, I was able to reproduce and it's essentially due to this change (https://github.com/grpc/grpc-go/pull/6834) in grpc-go v1.60.0. (gNMIc v0.34.3 uses grpc-go v1.59.0)

In short, the change in grpc-go v1.60.0 makes the gRPC client use the OS TCP keepalive defaults instead of the Go stdlib ones. Most linux dist use:

tcp_keepalive_time = 7200s
tcp_keepalive_intvl = 75s
tcp_keepalive_probes =  9

While Go stdlib uses 15s, 15s and 10. This results in a broken TCP connection being reset after about 15 + 10 * 15 = 165 seconds

I tested setting the OS TCP keep alive using with the below commands, gNMIc reconnects successfully once the TCP connection is reset and the router is back up.

sysctl -w \
  net.ipv4.tcp_keepalive_time=15 \
  net.ipv4.tcp_keepalive_intvl=15 \
  net.ipv4.tcp_keepalive_probes=10

I will have to think a bit about the best way to fix this, ideally I can enable gRPC keepalives by default providing most gNMI servers out there support it. Or use a custom TCP Dialer to avoid grpc-go creating one with the OS defaults. I will keep you posted.

LDTips commented 6 months ago

Alright thanks for the confirmation. Do you know how could I fix this issue with these sysctl rules for the containerised gNMIc version? Or is just the better solution for the time being to use 0.34 version instead?

karimra commented 6 months ago

Depends how you are running the container, docker run has a --sysctl flag, so you can do something like this:

docker run \
--sysctl net.ipv4.tcp_keepalive_time=15 \
--sysctl net.ipv4.tcp_keepalive_intvl=15 \
--sysctl net.ipv4.tcp_keepalive_probes=10 \
-it --rm -p 7890:7890 -v XXXXXX

That typically is not allowed with --net host, so you might want to run it on its own netns. Or modify the host values if that doesn't impact anything else. Docker compose has similar options for sysctl.

karimra commented 6 months ago

v0.36.1 has a default TCP keepalive of 15s, please check it out.

LDTips commented 6 months ago

Quick test shows that the fix works. Thank you for the fix! I will let you know if this issue arises again