Open olivierboucher opened 8 months ago
We are getting the same issue in one of our production environment and I have been able to track down the reason.
First, we are using redis sentinel with an master-replica setup in HA, this is running as a stateful set in kubernetes with the bitnami/redis helm-chart as installation. We have configure the useHostnames: false
when the helm is deployed, this to get the IP instead of host-names. I have read after this was setup that it may be better to use the hostname when running services inside kubernetes but haven't have time to try that out yet.
Every now and then, the redis pod inside kubernetes is re-scheduled into a new worker or the redis-deployment is updated so the pod is recreated. Some times, the pod is getting the same IP back and then it works, but often they are getting a new IP back. Sentinel keeping track of what IP that are active and should be used, but the sentinel also keep a list of old IPs that not is active (https://github.com/bitnami/charts/issues/5418) the IConnectionMultiplexer.GetEndPoints()
returns a list with all known endpoints, even if we not can connect to them, when the IConnectionMultiplexer.GetServer(endPoint)
is executed it will throw an ArgumentException
if the endpoint isn't connected and this will blow up and make the lock throw exception instead of real result.
I copied the GetHost method into a controller that writing out the result of the information and after I added the try/catch as in the linked pr the result looks like
10.42.12.167:6379 (Cant connect to host), 10.42.12.208:6379 (Cant connect to host), 10.42.34.52:6379 (Cant connect to host), 10.42.34.67:6379 (Cant connect to host), 10.42.12.208:6379 (Cant connect to host), 10.42.12.67:6379 (master, disconnected), 10.42.34.52:6379 (Cant connect to host), 10.42.12.167:6379 (Cant connect to host), 10.42.34.128:6379 (master, disconnected), 10.42.34.52:6379 (Cant connect to host), 10.42.34.67:6379 (Cant connect to host), 10.42.12.67:6379 (master, disconnected), 10.42.34.165:6379 (master), 10.42.34.78:6379 (slave), 10.42.34.128:6379 (master, disconnected), 10.42.12.60:6379 (slave), 10.42.12.67:6379 (master, disconnected), 10.42.34.165:6379 (master), 10.42.34.78:6379 (slave), 10.42.34.128:6379 (master, disconnected), 10.42.12.60:6379 (slave), 10.42.12.67:6379 (master, disconnected)
In general I think this logging can be improved (or cached) with better result, now the connection information will not be clear and it will take resources for each execution that using the GetHost method.
@samcook Do you have time to look into this and the linked PR? Maybe a patch release with this?
Our workaround with using host names instead of IPs did not work because we run into another issue that was setting redis sentinel in tilt-mode, so we need to reverted that change.
Hi @Tasteful,
I've had a look and managed to reproduce the issue - it seems like there's an issue with StackExchange.Redis when it loses its current sentinel connection (and your redis instances are on ephemeral IPs, like in Kubernetes), after reconnecting it seems to still retain some endpoints that it doesn't have matching 'server' entries for.
Anyway, the proposed PR looks like it's probably a reasonable solution. I'll take a look at getting that merged in and pushing out a new release tomorrow.
As an aside though, looking at the StackExchange.Redis behaviour, if you can solve the tilt mode problem it's probably best if you can use the hostname mode, as over time StackExchange.Redis seems to ends up with more and more of these phantom connections in its list of endpoints.
Thanks!
Yes, the tilt mode actual exists in another https://github.com/bitnami/charts/issues/9689 and the solution on that is to use ip-addresses instead of hostnames :)
Earlier today I created my own version of the Redlock dll and injected in the deployment pipeline and have from that point no logs about distributed lock exceptions.
FWIW, we were getting this same error, but it came down to the simple issue of us giving existing ConnectionMultiplexer connections to the RedLockFactory and then later disposing those connections while continuing to use the RedLockFactory instance. :facepalm:
I just figured I'd mention this just in case it saves someone else some time.
Hi,
We're running into this issue every now and then on our production environment:
It seems to happen randomly and sometimes months go by without encountering the issue.
Here is how we initialize the factory:
The code creating the lock:
Any idea what could be the root cause of this sporadic issue or how to mitigate against it?