Recover from bad redis connection

j3k0 commented 7 years ago

Scenario:

Master redis host goes down and the slave automatically failed-over (promoted to slave by sentinels).
When old master is back, it is made a slave of the new master.
But for some reason, a ganomede-invitation instance didn't close the connection with the original redis master server. So it's now connected with the slave.
All write operations fail (see error below), service is still marked as "healthy" (replies to ping), but never recover from this status

    failed to EXPIRE redis key `jeko` Error: READONLY You can't write against a read only slave.
        at ReplyParser.<anonymous> (/home/app/code/node_modules/redis/index.js:317:31)
...

Possible solutions:

restart the worker in case of any redis error (throwing the error should do just that)
run INFO REPLICATION on all linked redis databases from /ping/, to see if we're connected to master instances. Fail if not.
- This probably a general issue in all services. A clean change to the ping router (parametrized with all redis connection to check) should make it relatively to backport.

Maybe 1 is good enough and easy to put in place, since db requests are pretty centralized?

elmigranto commented 7 years ago

I'm wondering whether 1 would work. Wouldn't it connect to same instance given same env vars?

j3k0 commented 7 years ago

Starting a new instance with same env var work. It just needs to close the tcp connection with the old master

elmigranto commented 7 years ago

Are you sure? Redis docs mention a command for obtaining address of new master, and broadcasting configuration in case failover occurs. This package's readme also has something like that mentioned.

Or is that part handled by docker image and we simply reconnect to the same host:port? Just want to make sure :)

j3k0 commented 7 years ago

Yes, it should just reconnect to the same host:port, it's supposedly transparent to the client. In prod we connect clients to a so-called "ambassador", an haproxy instance that transmits redis TCP connections to the master host (in priority) and falls back to a slave if master is not responding, so read operations are still handled.

When I think about it, the problem is probably that haproxy doesn't close opened TCP connections when the master changes. This might be interesting to test and fix (if possible), eventually clients won't need any change in this case.

elmigranto commented 7 years ago

and falls back to a slave if master is not responding

Yeah, but If I understand correctly, we want to reconnect to a new master selected by sentinel failover, so that writes succeed as well. But if we, say, listen for a new master with sentinel pub/sub or query for info replication (or switch to ioredis which has same API for issuing commands but handles sentinels automatically), we might receive different host and/or port, which I'm not sure is available from our docker container, I assume we only see haproxy from there.

elmigranto commented 7 years ago

Anyways, what I started doing is to wrap all redis commands with our own class, and monitor errors. If something like in your original post occurs, we'll discard current client and create a new one.

And later, we can add a check to ping API as well.

elmigranto commented 7 years ago

Okay, here's first pass, should work almost as drop-in replacement for redis (at least for what we use and config object not being root level constructor option). It wraps redis client and for each command executed checks whether we should reconnect with passed in function (that tests for READONLY by default). Differences:

multi() does not support nested transactions (so no redis.multi().multi() allowed);
multi() only works via chaining (no args accepted, node-redis docs mention some optional thingy, not sure what that is);
batch() is not supported.

Please check it out, let me know what you think. I'm not quite comfortable judging it only by myself (unless you solved this one some other way already). Overall, this sounds okay to me, but if it's possible to switch to ioredis given Docker situation, I'd rather we do that :)

redis-fix.zip

j3k0 commented 7 years ago

I'll check it out in a few days (I'm on the move those days).

Please for the next task, make sure you check waffle: there are other things more urgent to take care of. This one was an issue that occurred just once in the last couple of years, nothing I'd classify as high priority.

j3k0 / ganomede-invitations

Recover from bad redis connection #13