taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.15k stars 131 forks source link

Auto-reconnect in Subscribe connections? #149

Closed mpenet closed 8 years ago

mpenet commented 8 years ago

Hi Peter,

Is there any auto-reconnect in SUBSCRIBE? From a quick look at the code it seems it's not the case, disconnections might even be undetectable:

https://github.com/ptaoussanis/carmine/blob/master/src/taoensso/carmine.clj#L354-L361

There seem to be no way for this while true to ever exit and exceptions are swallowed if I am not mistaken?

For some context in that scenario we're using pub/sub and don't really care about dropping messages but reconnection/failover is mandatory. It's also a single redis instance, so using "sentinel" seems not applicable.

Cheers

Max

ptaoussanis commented 8 years ago

Hey Max, long time - how've you been? :-)

Just to make sure we're on the same page, could you define what you mean by failover in this case?

Do you maybe have something like https://github.com/ptaoussanis/carmine/issues/15 in mind?

mpenet commented 8 years ago

All good Peter :}

Yeah I just noticed you layed fundations for this: https://github.com/ptaoussanis/carmine/blob/master/src/taoensso/carmine/connections.clj#L32

That's exactly what I had in mind. I kind of need something like this asap tho, even if it's imperfect (ex. a heartbeat in another thread/connection to at least detect "some" failures such as network partitions and be able to react on these). Or maybe I ll just bake this in our protocol and have the heartbeat in a separate channel within the same connection (this is probably safer).

mpenet commented 8 years ago

Yeah failover wasn't really the correct term here, my bad, what I really mean is auto-reconnect and/or connection failure detection.

ptaoussanis commented 8 years ago

What version of Redis are you on? 3.2, or earlier?

mpenet commented 8 years ago

2.8.23 if I recall, but 3.2 is still in "unstable", so I am not sure devops would like that. My guess is 3.0.5 would be the max version allowed there.

ptaoussanis commented 8 years ago

Okay, sounds like you'll need something adhoc then. So for example:

  1. Publisher broadcasts a PING on a loop every x ms
  2. Subscribers timeout+reconnect on a loop every >x ms. On PING receipt, they extend the timeout.

Does that help / make sense?

mpenet commented 8 years ago

That's exactly what I have in mind at the moment yes. In a channel with an internal id auto generated, so at least this is local to the Connection instance, and from there do exponential backoff and all.

ptaoussanis commented 8 years ago

Great, good luck! Please feel free to close if you don't run into any issues. Will keep #15 open until this is built in.

mpenet commented 8 years ago

Great. Looking forward to #15.

mpenet commented 8 years ago

Actually we have a problem with the current version, there's no way to shutdown the Future and its associated thread in https://github.com/ptaoussanis/carmine/blob/master/src/taoensso/carmine.clj#L354-L361 . In the long run this could starve the default threadpool used by future-call (which sadly is shared by tons of stuff in clojure itself, one of my pet peeves). Creating a separate pool via set-agent-send-off-executor! to mitigate the issue doesn't solve anything, since it will (again) be shared by everything else (recuring theme, clojure.core/future* sucks).

Simply exposing the future used in Listener would solve the issue, then close-listener can perform proper cleanup as well. Would you consider such a change?

ptaoussanis commented 8 years ago

Simply exposing the future used in Listener would solve the issue, then close-listener can perform proper cleanup as well. Would you consider such a change?

Roger!

mpenet commented 8 years ago

I am going to close this, but I just wanted to expand the discussion with an idea:
It could be nice to provide utilities (maybe as separate libs even) for command retries and other related tools, in cases of network partitions or other bad situations (reboots etc). It could be easy to implement on top of durable-queue (https://github.com/Factual/durable-queue) with it's retry/complete features. For instance a macro/fn that would mark calls as retriable in case of failure that would send failed commands into a (disk backed) queue and retry them in order in when redis is available again, with the possibility to do this asyncronously or not.

That could work nicely in some scenarios, but it wouldn't be a silver bullet for sure.

mpenet commented 8 years ago

Hystrix can be used for these kind of things too, but maybe we don't need something as complete.

ptaoussanis commented 8 years ago

Hi Max,

Not sure what you have in mind exactly; are you suggesting that Carmine add general-use network utils?

It could be easy to implement on top of durable-queue

Not sure I follow, sorry. Is there a reason you'd want to use an external tool instead of just using Redis as a durable queue?

mpenet commented 8 years ago

What I had in mind is something similar to http://mpenet.github.io/alia/qbits.alia.policy.retry.html in the cassandra driver (which works per command, or per "session").

Basically right now carmine is very low level, does nothing to handle failures (network or others). If a vm with a redis server is being rebooted for instance, one has to handle that in their application code at the moment, catching the exceptions and retry the commands later or drop them or do something else. What I meant is it could be possible for carmine to handle this the same way the cassandra driver does. This is also something you would find in other redis clients such as https://github.com/mrniko/redisson.

Then the mention of durable-queue is just an implementation detail, you can use an in-memory queue just fine, durable-queue just makes it more resilient (it would survive a reboot of the vm where the client sits). Such an implementation could be one example of a retry-policy with a durable queue, but shouldn't be the norm I guess.

ptaoussanis commented 8 years ago

What I had in mind is something similar to [...]

Hmm, not sure I'd agree with that making sense in the context of Redis. Seems like a lot of abstraction/machinery for something that'd be easier + clearer to handle with the current set of tools.

If you're in an environment where you're seeing so many timeouts / unavailable nodes that that kind of abstraction starts looking attractive then you should probably just be using Redis sentinel, no? Is there a reason you wouldn't be?

I might be misunderstanding.

Basically right now carmine is very low level, does nothing to handle failures (network or others)

Not sure I'd characterise it that way? What kind of failures do you have in mind besides network failures? And network failures are generally either something you'll want to handle through sentinel (which was designed for the purpose), or in a case-specific way that makes sense for the domain. You'll want to handle these differently based on the data structure + case.

One of the strengths of Redis (and reasons you'd use it) is because you want to be operating at a level where you have control over specific data structures, etc. Trying to generalize or automate policy over that is likely going to be more effort than it's worth.

Might make sense for something like Cassandra though, not sure.

This is also something you would find in other redis clients such

That doesn't seem to be a Redis client so much as a distributed Java-style data structure server implemented over Redis? You're suggesting that you'd like something like that to be in Carmine core?

Such an implementation could be one example of a retry-policy with a durable queue, but shouldn't be the norm I guess.

Yeah, sorry- as I'm understanding this (possibly misunderstanding?) - definitely not something that I'd want to see in Carmine. If you're looking for a durable message queue, I'd use one of those (e.g. the one provided in Carmine, Disque, etc.). If you're looking for network failover, I'd use sentinel. If you're looking for retry semantics, I'd suggest that you really wouldn't need or want abstractions there; just implement what you need on a case-by-case basis that makes sense for your particular needs + data structures.

I've seen some really horrible Redis deployments (usually in Java applications) where layers of unnecessary abstractions have been installed that make doing even the simplest things hard. In most cases just ripping everything out and working directly against Redis results in less code, less complexity, and better performance.

Feel free to file my response under crotchety/opinionated or just plain misunderstanding ;-)

Just to be clear: if you have any specific ideas, always happy to take a look at PRs! Might be easier to understand each other that way vs discussing in the abstract :-)

mpenet commented 8 years ago

No worries, I don't want carmine to implement all that by default, hence my suggestion to do it as a separate library. I am all for simplicity :}

But things like handling network failures at the client level is a different thing that what sentinel provides (from what I understand of it). Ex. if If your client machine(s) are cut from the redis server(s) (ex. network partition) , sentinel wouldn't help, your application cannot communicate with the servers and they must do something (drop commands, retry them later or whatnot).

It could also be more granular since you could have different retry policies per command.

It's arguable that is something that you should re-implement for all your applications separately, I mean in theory it's a possibility but you end up rewriting the same (boring) stuff over and over, that's why very often now db drivers provide this feature out of the box.

About durable-queue: I dont' want to use it as an alternative to a redis backed queue, I was suggesting it could be used as a "retry command" queue internally, in the code that would handle failures, with very high resilience. But forget it, I mean redis shouldn't be used for critical data imho, so maybe it's overkill anyway to have sort of "durability" here.

mpenet commented 8 years ago

btw, feel free to hit me up on slack/irc if you want to talk more about it. It might be easier & less noisy for watchers.

ptaoussanis commented 8 years ago

I mean in theory it's a possibility but you end up rewriting the same (boring) stuff over and over, that's why very often now db drivers provide this feature out of the box.

In fairness, some people say the same thing about Active Record. Not trying to be inflammatory, but think the comparison here actually happens to be useful.

Think this might be the crux of our it: my own experience is that there's not all that much to rewrite, and that trying to avoid the rewriting leads to all sorts of painful, leaky abstraction. This is in the specific case of Redis where the API is super pleasant and the data structures varied; not making a comment about this kind of abstraction for data stores in general.

btw, feel free to hit me up on slack/irc if you want to talk more about it

Thanks, will need to decline unfortunately. Actually pretty pinched for time atm and this isn't something I personally have a need for. Was just providing my 2c in case it turned out to have any use :-)

Re: adding stuff to Carmine - would suggest that the best next step would be some concrete code to help clarify implicit assumptions + context, etc. that I may be missing. Would definitely be open to looking at any specific code/API proposals :-)

mpenet commented 8 years ago

IMHO It's separate to how redis works and what it provides, it's a valid concern for any networked DB client and a convenience. There are libraries dedicated to these kind of issues and how to solve them already, but sometimes, as you say, they are too leaky and I believe being a bit specialized doesn't hurt. But I personally do not have the need for this at the moment, I was merely trying to spark a discussion about it. I am also in a rush for a end-january deadline but I might share some examples with you after this date.

cheers