j3k0 / ganomede-invitations

Ganomede Invitations
0 stars 0 forks source link

Recover from bad redis connection #13

Open j3k0 opened 7 years ago

j3k0 commented 7 years ago

Scenario:

    failed to EXPIRE redis key `jeko` Error: READONLY You can't write against a read only slave.
        at ReplyParser.<anonymous> (/home/app/code/node_modules/redis/index.js:317:31)
...

Possible solutions:

  1. restart the worker in case of any redis error (throwing the error should do just that)
  2. run INFO REPLICATION on all linked redis databases from /ping/, to see if we're connected to master instances. Fail if not.
    • This probably a general issue in all services. A clean change to the ping router (parametrized with all redis connection to check) should make it relatively to backport.

Maybe 1 is good enough and easy to put in place, since db requests are pretty centralized?

elmigranto commented 7 years ago

I'm wondering whether 1 would work. Wouldn't it connect to same instance given same env vars?

j3k0 commented 7 years ago

Starting a new instance with same env var work. It just needs to close the tcp connection with the old master

elmigranto commented 7 years ago

Are you sure? Redis docs mention a command for obtaining address of new master, and broadcasting configuration in case failover occurs. This package's readme also has something like that mentioned.

Or is that part handled by docker image and we simply reconnect to the same host:port? Just want to make sure :)

j3k0 commented 7 years ago

Yes, it should just reconnect to the same host:port, it's supposedly transparent to the client. In prod we connect clients to a so-called "ambassador", an haproxy instance that transmits redis TCP connections to the master host (in priority) and falls back to a slave if master is not responding, so read operations are still handled.

When I think about it, the problem is probably that haproxy doesn't close opened TCP connections when the master changes. This might be interesting to test and fix (if possible), eventually clients won't need any change in this case.

elmigranto commented 7 years ago

and falls back to a slave if master is not responding

Yeah, but If I understand correctly, we want to reconnect to a new master selected by sentinel failover, so that writes succeed as well. But if we, say, listen for a new master with sentinel pub/sub or query for info replication (or switch to ioredis which has same API for issuing commands but handles sentinels automatically), we might receive different host and/or port, which I'm not sure is available from our docker container, I assume we only see haproxy from there.

elmigranto commented 7 years ago

Anyways, what I started doing is to wrap all redis commands with our own class, and monitor errors. If something like in your original post occurs, we'll discard current client and create a new one.

And later, we can add a check to ping API as well.

elmigranto commented 7 years ago

Okay, here's first pass, should work almost as drop-in replacement for redis (at least for what we use and config object not being root level constructor option). It wraps redis client and for each command executed checks whether we should reconnect with passed in function (that tests for READONLY by default). Differences:

Please check it out, let me know what you think. I'm not quite comfortable judging it only by myself (unless you solved this one some other way already). Overall, this sounds okay to me, but if it's possible to switch to ioredis given Docker situation, I'd rather we do that :)

redis-fix.zip

j3k0 commented 7 years ago

I'll check it out in a few days (I'm on the move those days).

Please for the next task, make sure you check waffle: there are other things more urgent to take care of. This one was an issue that occurred just once in the last couple of years, nothing I'd classify as high priority.