jlouis / fuse

A Circuit Breaker for Erlang
MIT License
502 stars 49 forks source link

Control the service restart after blown #9

Open posilva opened 8 years ago

posilva commented 8 years ago

Hi,

I am using fuse to control the access to a backend service (DAL API). It would be nice that we have some good way of passing from blown to ok gradually. Otherwise, if the backend is under load (502/503), for example, the requests will be back on charge all at the same time, after the "heal" interval and can cause problems again.

Thank you,

(I will be able to implement a solution if you think that may be useful)

Pedro

jlouis commented 8 years ago

Yes, this is a good idea, which I've also considered implementing.

The problem with its implementation is "how are you going to build a quickcheck model for it?". You need to come up with a good way of describing what gradually become ok means, and hopefully in a "deterministic" way. One of doing so is to control the RNG from the model so you can decide what the outcome of RNG lookups are.

The other problem is how you are going to let a few through. The fuse is an ETS table lookup, so if you flip that to 'ok' then the system will almost surely let a few through. So you would need some kind of "{gradual, Pct}" for some percentage, with the RNG controlled by the model.

This, and also its cousin of manually being able to disable/reenable fuses, are probably two of the most needed features.

If you come up with a better scheme, I can try to figure out if I can build a QC model for that.

jlouis commented 8 years ago

Ok, this is doable if we just control the RNG in the test cases, which is fairly easy.

What do you think the configuration should look like? I think there are a number of things here:

I'm pretty sure I could build a quickcheck model for this kind of system, since I can mock the RNG and control its outcome, so I can say what the system should do in the different cases. I could also improve the timing mocking for this.

jlouis commented 8 years ago

More thoughts:

jlouis commented 8 years ago

Some implementation plan for a QC model:

A first implementation should probably support a new type of fuse {fault_rate, Rate, Intensity, Period} which fault-injects every 1/Rate requests on average. This can verify the above model is in place and works.

Once you have support for this, it should be easy to add gradual ramping to the system.

The price to pay are parallel invocation models for this change, as they cannot be handled by such a system. So we would have to keep a parallel model around separately for this.

posilva commented 8 years ago

Hi,

I am not familiar with QuickCheck but now I have a good chance to learn about it, as soon as I have a model designed and something to show I will let you know.

jlouis commented 8 years ago

We already have most of the model in test/fuse_eqc.erl, so that is a starting point. It needs to use component based models however, to handle what I'm suggesting above.

zeeshanlakhani commented 8 years ago

Yeah, we've discussed something similarly w/ our use of Fuse to handle Solr (and other third-party systems) issues (w/ solr_cores) under load. Being able to gradually pass from blown->ok would be a better model of how we expect our fuse-wrapped operations to eventually resolve. I'd be down for reviewing and/or helping w/ QC if there are questions too when I'm back around next week.

jlouis commented 8 years ago

One important observation is that a standard fuse with a reset of 60*1000 would be a gradual fuse with [{60*1000, 1.0}] as in it goes to maximal rate in one step. This means we can handle the standard fuses as a special case of gradual fuses, which collapses a lot of the code base.

zeeshanlakhani commented 8 years ago

@jlouis yep... that observation makes 100% sense to me :).

jlouis commented 8 years ago

Ok, #10 has a new fuse_eqc based on an eqc_component model. This model can handle a fault_injection type fuse and will verify the RNG components needed to support this issue as well. Things needed:

jlouis commented 8 years ago

The model has been taught about installing and handling fuses of fault_injection type. This completes the model. We just need to handle the code itself.

Looking forward:

posilva commented 8 years ago

We can have a also another approach, instead of adding delay to the "ok" state we can fail even faster if we are in a "gradual interval":

  1. The fuse enters in the blown state
  2. After reset interval it will pass to ok
  3. If it fails in the short interval of time we will go again to blown otherwise we keep the ok
  4. After gradual interval without melt we are 100% operational.

With this approach in the case of the backend service recover well, we do not loose requests, but if the service starts to fail again we will have the chance to fail fast/sooner and back off for a some short period of time (depending on the fail rate). If the period between fails is small in the gradual interval we will have the chance to back off more time.

This fuse could be a fail_fast type.

I hope this idea is clear enough :)

jlouis commented 8 years ago

I think it would make sense that in a "gradual" setting, we immediately fall back to error if it fails. I also think we can implement this with an update_counter/3 to the ETS table without accidentally ending up letting too many through. Of course, given the async context, we can't necessarily be totally void of races, but that is okay.

Perhaps with a bit more thinking, it is possible to figure out how this fuse type can be added to the system.

jlouis commented 8 years ago

The reset policy is a command language. You give commands [C1, C2, C3, ...]. The possible commands are:

The standard {reset, Ms} is encodable as [{delay, Ms}, heal] in this scheme. Gradual ramping is supported, and @posilva's ideas are supported as well. You can get any mix possible: e.g., [{delay, 60*1000}, {test, 3}, {gradual, 0.25}, {delay, 10*1000}, {gradual, 0.5}, {delay, 10*1000}, heal] would:

posilva commented 8 years ago

Hey,

@jlouis with the concept of having a reset command sequence, you this circuit breaker can deal with any type of backend recover policy, under pressure we can reinstall/reconfigure and adapt to a specific recover sequence. And of course this is also a good tool to test "frontend" systems behaviour when subject to "backend" failures.

Nice suggestion!!!

This reset sequence could be implemented with a gen_fsm?

jlouis commented 8 years ago

The way to implement this is to first support a simpler variant, namely a reset policy {test, K, Ms} which will later expand into the notion of [{delay, Ms}, {test, K}] internally. It is a nice stepping stone toward the final solution we want, and we can then test the fuse behavior without having to implement all of the language in the first place.

jlouis commented 8 years ago

First, we need to update the model. We need to stop tracking the blown state and directly calculate it from the melt history. This is more functional and has fewer moving parts. This allows us to add another way to track that the fuse is in a testing state, which becomes a special state of its own.

But by removing the blown tracking first and using the melt_history we can avoid having to specify a lot of the interaction between the two states. This hopefully simplifies the model and makes it easier to get correct.

jlouis commented 8 years ago

The model update is #11 and it vastly simplifies the model.

The next step is to add a tracking in the model of a fuse being in the test state:

jlouis commented 8 years ago

It turns out we cannot use #11, so it is back to the drawing board, probably by accepting the complexity of the model and then adding the {test, K, Ms} setting.

jlouis commented 8 years ago

The test command is fundamentally hard to add:

This is possible to model all of these considerations, but it gets quite nasty since we will need eqc_component ?BLOCK style in order to handle this as well as sessioned ask/1 commands which may be followed by a melt/1. In other words, adding a {test, ...} style check has some quite severe deep implications on fuse. We need to think a bit more on this. The {gradual, ...} solution is not limited by these considerations at all, since it only cares about normal melt errors.

jlouis commented 8 years ago

New idea, inspired by @lehoff in a loose way:

The thing that is hard with a {test, K, Ms} style command is that it starts to encode a lot of policy about what a test is into the fuse_server. This is hard to model, since all of a sudden, the model needs to take care of not only the fuse_server, but also any calling process, as they become part of the correctness of the system as a whole.

But if we supported {barrier, Term} we could make things work out I think:

In turn, we can now support any model you can think up outside the scope of the fuse system itself. A {test, K, Ms} model is fairly easy to set up, if you just keep a process around to resolve letting K callers try and then unlock if all of them are good, or melt the fuse into the blown state if any of them are bad. It keeps the fuse model fairly simple by itself. It punts the hard policy part to a system of your own, and it can support many complex models easily. Most importantly, it is easy to model.

posilva commented 8 years ago

With this model, the caller process has to call {barrier, Term} to control the flow in a certain point in time, after a blown, for example, and after the reset timeout the state when asked would be {barrier, Term, Token}. Or we could configure the reset to be a automatically the {barrier, Term} after the reset timeout?

Now we have {reset, Ms} for standard fuses and we could have a controlled_fuse with {reset, Ms, {barrier,Term}} after the reset timeout the fuse will go to this state and during this state (until we call fuse:unlock_barrier(Name, Token)) any melt will pass to the blown state (other wise we will not fail fast and have to wait to get all melts again)

jlouis commented 8 years ago

An update: I added timing to the EQC model in #12 which has uncovered some bugs in fuse w.r.t. timing. I think @zeeshanlakhani / @lehoff might be interested in these and fixes of them going forward, so just pinging them :)

I'll probably build a point-release, but I'm not sure it will fly on release 16 yet. Backwards compatibility should be fairly easy though since there should be a time-compat module and a rand-compat module for handling the backporting. The rest of the code should be R16 safe, I think.

The problem is somewhat benign: if a fuse is melted too much just as it blows, then more than a single timer is set on the fuse. This can lead to fun situations when the timer clears again, but I don't think it will. However, that world is somewhat undefined behavior :)

If you want me to track the state explicitly for, say, Basho, just open an issue on this repo.

jlouis commented 8 years ago

On this issue though: #12 implements the necessary timing scaffolding which eventually lets us model the proposal in this issue. It is a prerequisite step since it puts timing under the wings of the model and we now control time explicitly in a EQC component based cluster.

jlouis commented 8 years ago

With #12 implemented, we can start modeling the real code for the system. This comment describes what is needed:

First, we must introduce a notion in the model of a command list. Given a fuse, it's reset policy is given as a list of commands, and we have a "next" command which explains the current state of the fuse. If we have, say, [{delay, 30}] and we get a timing event for the delay, we proceed, to the state [] and then carry out this by healing the fuse.

When the fuse is blown, we start processing this list. We introduce an internal call to "process commands" which then places the model in the correct state. We can implement this in the model without altering the SUT, and we can make it "backwards compatible". Once this is in place, we have the necessary stepping stone to implement the remainder of commands.

jlouis commented 8 years ago

We have the first part of a command processor for this issue. It is implemented in #14 and is going into the model soon.

jlouis commented 8 years ago

We have delay implemented, but we still need to figure out what the correct thing to do on grudual and barrier in the model.

jlouis commented 8 years ago

The way you handle this is to alter lookup_fuse (an internal command in the model) such that it looks at the current fuse state and hands back the current fuse state to the system. The fuse state is then coded on being part of a data type:

Command processing changes these states accordingly in the fuse, and lookup_fuse consults these states to figure out what is happening.

jlouis commented 8 years ago

New finding:

We need to simplify the model first. To do this, we must introduce a new record #fuse{} which tracks the state of a fuse. And then we need to take the current states, disabled and blown and move them into the fuse record itself.

Once this is complete, it is far easier to handle the above scheme, without going mad trying to do so. Also, the simplification will make it easier to extend the system later on.

jlouis commented 8 years ago

Managed to simplify half of the model now. Still need to simplify disabled and melt states into fuses.

jlouis commented 8 years ago

Disabled has now been folded into the fuse state. Still need to work on melt.

ahmadferdous commented 3 years ago

hi @jlouis , thank you for this very useful library. I was wondering if the half-open state of circuit breaker eventually got implemented in fuse. I see one or two references to gradual in code such as this. I don't see this documented in the API reference or tutorial though. Can you please confirm if half-open state is supported by fuse? Thanks.

jlouis commented 3 years ago

@ahmadferdous unfortunately it's not there, yet. There are some test-code scaffolding in place to make sure it will work, but the code itself doesn't really support this notion as of now. It's one of those things I've been interested in doing at some point, but I got distracted with other stuff for a couple of years, heh.

jlouis commented 3 years ago

Necromancy! Hitting this with a Necrobolt of work :)

The model has been brought up-to-date, and we are now processing "standard" fuses as a command list. In particular delay has been implemented. The next part is to implement the gradual command I think.

posilva commented 3 years ago

Great news