taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.15k stars 130 forks source link

Connection Recovery #201

Closed allen-servedio closed 1 year ago

allen-servedio commented 6 years ago

I am not sure if (or should) anything can be done here, but I am seeing this scenario and wonder if there is a way to have the system recover from this more quickly:

  1. Service has started normally and is able to connect to redis and make requests.
  2. Either redis goes down and/or there is something else that breaks the connection between the service and redis (so a network partition, etc..)
  3. Redis comes back up (or network partition is resolved).
  4. Next attempt by service to contact redis will fail with an EOFException (see below). Any subsequent requests will work because this failure appears to trigger EITHER a reconnect or throwing away the connection (I did not dig into the code to see which).

Is it possible (and is it a good idea?) for Carmine to try to refresh the connection and retry the call once, when this happens, before giving up and failing?

I have not tried this in a highly concurrent environment yet where I would have several connections in my pool so I don't know if this would also happen for each connection in the pool...

Here is the error that I see (truncated to knock out my code):

#error {
 :cause nil
 :via
 [{:type java.io.EOFException
   :message nil
   :at [java.io.DataInputStream readByte DataInputStream.java 267]}]
 :trace
 [[java.io.DataInputStream readByte DataInputStream.java 267]
  [taoensso.carmine.protocol$get_unparsed_reply invokeStatic protocol.clj 122]
  [taoensso.carmine.protocol$get_unparsed_reply invoke protocol.clj 106]
  [taoensso.carmine.protocol$eval20396$get_parsed_reply__20397 invoke protocol.clj 198]
  [taoensso.carmine.protocol$execute_requests invokeStatic protocol.clj 320]
  [taoensso.carmine.protocol$execute_requests invoke protocol.clj 284]
  [taoensso.carmine.protocol$_with_replies invokeStatic protocol.clj 342]
  [taoensso.carmine.protocol$_with_replies invoke protocol.clj 328]
aravindbaskaran commented 6 years ago

@allen-servedio Just to get more context - Is this within a with-new-pubsub-listener or with-new-listener?

If yes, I have the same questions, have been trying to address it here - #207. Or if that is an overkill, have added some extra macros that address your questions here - https://github.com/aravindbaskaran/redis-pubsub

Do let me know your thoughts on that?

kendagriff commented 5 years ago

I'll pile on—we're also getting frequent EOFExceptions, including SocketTimeoutExceptions and "Carmine connection error"s. Our application is NOT using pub/sub, only blocking stuff, and we have a super basic setup.

(def conn {:pool {} :spec {:uri (env :redis-url)}})

We are, however, on Heroku which has a default connection timeout of 300 (seconds). Is that possibly impacting the situation? We're trying to decide whether we need to simply wrap our wcar macro in with-retries type of stuff.

ptaoussanis commented 5 years ago

Hi all-

Haven't had an opportunity to look at this properly yet, but just checking in the meantime: has anyone tried tuning their connection pool config? Carmine doesn't have a built-in retry mechanism, but the pool's behaviour is fairly configurable (for example- you could setup test-before-use).

kendagriff commented 5 years ago

Thanks for the speedy response. I'm seeing test-on-borrow?, test-on-return?, and test-while-idle?. Is that what you're talking about?

kendagriff commented 5 years ago

And I'm assuming they have a performance trade-off?

kendagriff commented 5 years ago

For what it's worth, Heroku responded to my questions about dropped connections:

There are many factors that can cause a Redis connection to die, including networking blips on either the Redis or Dyno side, or network blips in the underlying platform (AWS). But your database connection pool should protect you against this because if a connection fails it should be replaced. You might want to check your client to see if it has connection testing, or periodic pinging configuration options. This can sometimes prevent connections from dying and detect failures early.

@ptaoussanis: Just making sure I understand correctly: When a connection in the pool goes bad, that's only discoverable by a failed query, right?

I implemented a with-retries-type strategy (using again), and it's made a huge improvement. Having said that, however, is there not a case for implementing a retry strategy in carmine? test-on-borrow?, from the little research I've done, appears to introduce a big performance trade-off, which you wouldn't want turned on to address occasional network blips.

I can totally appreciate a case against a retry strategy too (given that retry strategy implies, well, a strategy, and apps might differ in how they want that set up); just curious what your thoughts are...

zcaudate commented 4 years ago

I'm getting the java.io.EOFException as well in testing using 2.20.0-beta2 when a redis instance is restarted. I'm not using a pool, just a standard connection.

The error only goes away when the repl has restarted.

Is there a way to reset this?

ptaoussanis commented 1 year ago

Apologies for the long-delayed reply!

@kendagriff @allen-servedio

Thanks for the speedy response. I'm seeing test-on-borrow?, test-on-return?, and test-while-idle?. Is that what you're talking about?

Correct, that's exactly what I'm talking about 👍 I'd generally keep at least test-on-borrow? and test-while-idle? enabled, especially if you're seeing connection issues.

"test" in this context just means to ping the server and confirm that it responds with a pong. This'll rule out lots of potential issues like an interrupted connection, the server being down, etc.

test-on-borrow? is the most important. If enabled, it'll ensure that the pool will never hand you a faulty connection. If it draws a connection that fails testing, it'll invalidate the connection and retry with another until it finds one that passes.

test-on-borrow?, from the little research I've done, appears to introduce a big performance trade-off, which you wouldn't want turned on to address occasional network blips.

What data were you looking at specifically? test-on-borrow? just introduces a ping on connection acquisition. The impact really should be pretty trivial for (I'd guess) 99+% of use cases. I'd encourage you to bench the performance impact yourself and see if it's relevant in your scenario.

If you're doing something with really exceptional latency requirements, you could consider test-on-return? and test-on-idle? instead - since these won't affect connection acquisition latency.

Edit to add: Carmine v4 will have test-on-borrow? and test-on-idle? enabled by default.

@zcaudate

I'm getting the java.io.EOFException as well in testing using 2.20.0-beta2 when a redis instance is restarted. I'm not using a pool, just a standard connection. The error only goes away when the repl has restarted. Is there a way to reset this?

I'm not sure what you mean by "reset" here, sorry. I'd recommend trying with a connection pool, and enabling connection testing if you're seeing frequent restarts.

Hope that helps! Again apologies everyone for the slow response.