Closed LDTips closed 8 months ago
Thanks for reporting this, I was able to reproduce and it's essentially due to this change (https://github.com/grpc/grpc-go/pull/6834) in grpc-go v1.60.0. (gNMIc v0.34.3 uses grpc-go v1.59.0)
In short, the change in grpc-go v1.60.0 makes the gRPC client use the OS TCP keepalive defaults instead of the Go stdlib ones. Most linux dist use:
tcp_keepalive_time = 7200s
tcp_keepalive_intvl = 75s
tcp_keepalive_probes = 9
While Go stdlib uses 15s
, 15s
and 10
. This results in a broken TCP connection being reset after about 15 + 10 * 15 = 165 seconds
I tested setting the OS TCP keep alive using with the below commands, gNMIc reconnects successfully once the TCP connection is reset and the router is back up.
sysctl -w \
net.ipv4.tcp_keepalive_time=15 \
net.ipv4.tcp_keepalive_intvl=15 \
net.ipv4.tcp_keepalive_probes=10
I will have to think a bit about the best way to fix this, ideally I can enable gRPC keepalives by default providing most gNMI servers out there support it. Or use a custom TCP Dialer to avoid grpc-go creating one with the OS defaults. I will keep you posted.
Alright thanks for the confirmation. Do you know how could I fix this issue with these sysctl rules for the containerised gNMIc version? Or is just the better solution for the time being to use 0.34 version instead?
Depends how you are running the container, docker run has a --sysctl
flag, so you can do something like this:
docker run \
--sysctl net.ipv4.tcp_keepalive_time=15 \
--sysctl net.ipv4.tcp_keepalive_intvl=15 \
--sysctl net.ipv4.tcp_keepalive_probes=10 \
-it --rm -p 7890:7890 -v XXXXXX
That typically is not allowed with --net host
, so you might want to run it on its own netns. Or modify the host values if that doesn't impact anything else.
Docker compose has similar options for sysctl.
v0.36.1 has a default TCP keepalive of 15s, please check it out.
Quick test shows that the fix works. Thank you for the fix! I will let you know if this issue arises again
During testing of gNMIc's behavior when the subscription device (router) was turned off, it was noticed that not always a session is reestablished. The behavior regarding session reestablishment differs for different version - for some gNMIc versions, restarting the router causes the gNMIc to stop receiving data, as it does not try establishing a new session. This is an issue for the continuity of the session - router restart should not require to restart gNMIc as well
Below attached see logs showing behavior for different versions. For older 0.34.x version the behavior is correct, but not for the newest 0.36.x
For 0.34.x
And now for 0.36.x the session does not get established after router restarts