mailgun / gubernator

High Performance Rate Limiting MicroService and Library
Apache License 2.0
964 stars 99 forks source link

Peer list update bug in K8s cluster #189

Open MAXEE998 opened 1 year ago

MAXEE998 commented 1 year ago

We ran a three-replica gubernator setup in our k8s cluster. When one pod was shut down gracefully by K8s, another pod (not all, just one) kept reporting

level=error msg="Error in client.GetPeerRateLimits" batchTimeout=500ms category=gubernator error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" queueLen=2

in the log.

Apparently, it didn't update its peer list accordingly. What may be the cause of this problem?

MAXEE998 commented 1 year ago

The problematic pod keeps trying to get rate limits from the shutdown peer according to the log:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 100.103.255.29:81: i/o timeout"
thrawn01 commented 12 months ago

I don't run a k8s cluster, so I really don't have a way to test this. I rely on the community to provide support for k8s.

miparnisari commented 10 months ago

FYI, this isn't limited to k8s. We run on ECS and see something similar. These logs seem to coincide with our deployments.

time="2023-10-25T23:47:56Z" 
level=error msg="error sending global hits to '10.0.37.143:9990'" 
category=gubernator 
error="Error in client.GetPeerRateLimits: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.37.143:9990: connect: connection refused\""

I need to do some research on my end to see if it's a bug on our service or on this library.