Open posilva opened 8 years ago
Yes, this is a good idea, which I've also considered implementing.
The problem with its implementation is "how are you going to build a quickcheck model for it?". You need to come up with a good way of describing what gradually become ok means, and hopefully in a "deterministic" way. One of doing so is to control the RNG from the model so you can decide what the outcome of RNG lookups are.
The other problem is how you are going to let a few through. The fuse is an ETS table lookup, so if you flip that to 'ok' then the system will almost surely let a few through. So you would need some kind of "{gradual, Pct}" for some percentage, with the RNG controlled by the model.
This, and also its cousin of manually being able to disable/reenable fuses, are probably two of the most needed features.
If you come up with a better scheme, I can try to figure out if I can build a QC model for that.
Ok, this is doable if we just control the RNG in the test cases, which is fairly easy.
What do you think the configuration should look like? I think there are a number of things here:
[{6000, 5}, {15*1000, 15}, 80*1000, 80}, {300*1000, 100}]
would ramp up the system at 6, 15, 80 and 300 seconds with the amount given as the percentage.I'm pretty sure I could build a quickcheck model for this kind of system, since I can mock the RNG and control its outcome, so I can say what the system should do in the different cases. I could also improve the timing mocking for this.
More thoughts:
0.0
and 1.0
. This allows one to supply fractions easily: 1/512
. And so on.Some implementation plan for a QC model:
jlouis/dht
's timing component to this model.A first implementation should probably support a new type of fuse {fault_rate, Rate, Intensity, Period}
which fault-injects every 1/Rate
requests on average. This can verify the above model is in place and works.
Once you have support for this, it should be easy to add gradual ramping to the system.
The price to pay are parallel invocation models for this change, as they cannot be handled by such a system. So we would have to keep a parallel model around separately for this.
Hi,
I am not familiar with QuickCheck but now I have a good chance to learn about it, as soon as I have a model designed and something to show I will let you know.
We already have most of the model in test/fuse_eqc.erl
, so that is a starting point. It needs to use component based models however, to handle what I'm suggesting above.
Yeah, we've discussed something similarly w/ our use of Fuse to handle Solr (and other third-party systems) issues (w/ solr_cores) under load. Being able to gradually pass from blown->ok would be a better model of how we expect our fuse-wrapped operations to eventually resolve. I'd be down for reviewing and/or helping w/ QC if there are questions too when I'm back around next week.
One important observation is that a standard fuse with a reset of 60*1000
would be a gradual fuse with [{60*1000, 1.0}]
as in it goes to maximal rate in one step. This means we can handle the standard fuses as a special case of gradual fuses, which collapses a lot of the code base.
@jlouis yep... that observation makes 100% sense to me :).
Ok, #10 has a new fuse_eqc
based on an eqc_component
model. This model can handle a fault_injection
type fuse and will verify the RNG components needed to support this issue as well. Things needed:
fuse:install/2
must be taught fault_injection
fuses are valid. In the model and in the code.fault_injection
fuse does not push ok
to the ETS table, but {gradual, Rate}
. The fuse:ask/1
command has already been taught to handle this.fault_injection
type fuses implemented.The model has been taught about installing and handling fuses of fault_injection
type. This completes the model. We just need to handle the code itself.
Looking forward:
eqc_component
to directly run the timing component. Create a cluster with timing.1.0
and replace that with the default "ok" state of the fuse (ok
or {gradual, Rate}
).We can have a also another approach, instead of adding delay to the "ok" state we can fail even faster if we are in a "gradual interval":
blown
statereset
interval it will pass to ok
blown
otherwise we keep the ok
gradual
interval without melt
we are 100% operational.With this approach in the case of the backend service recover well, we do not loose requests, but if the service starts to fail again we will have the chance to fail fast/sooner and back off for a some short period of time (depending on the fail rate). If the period between fails is small in the gradual
interval we will have the chance to back off more time.
This fuse could be a fail_fast
type.
I hope this idea is clear enough :)
I think it would make sense that in a "gradual" setting, we immediately fall back to error if it fails. I also think we can implement this with an update_counter/3
to the ETS table without accidentally ending up letting too many through. Of course, given the async context, we can't necessarily be totally void of races, but that is okay.
Perhaps with a bit more thinking, it is possible to figure out how this fuse type can be added to the system.
The reset policy is a command language. You give commands [C1, C2, C3, ...]
. The possible commands are:
{delay, Ms}
- Delay command processing for Ms
milli-seconds. After that, proceed to the next command in the sequence.{test, N}
- Let N
requests through. If they all complete without error, go on in the command sequence. Otherwise, start the command sequence over. This should be implementable with an ETS update_count/3,4
style message sequence.{gradual, Rate}
- If Rate = 0.05
we are letting 5% of all requests through to the service from here on in the command sequence. If they fail, they are subject to the standard Period/Intensity calculations.heal
- Heal the fuse completely.The standard {reset, Ms}
is encodable as [{delay, Ms}, heal]
in this scheme. Gradual ramping is supported, and @posilva's ideas are supported as well. You can get any mix possible: e.g., [{delay, 60*1000}, {test, 3}, {gradual, 0.25}, {delay, 10*1000}, {gradual, 0.5}, {delay, 10*1000}, heal]
would:
Hey,
@jlouis with the concept of having a reset command
sequence, you this circuit breaker can deal with any type of backend recover policy, under pressure we can reinstall/reconfigure and adapt to a specific recover sequence. And of course this is also a good tool to test "frontend" systems behaviour when subject to "backend" failures.
Nice suggestion!!!
This reset sequence could be implemented with a gen_fsm?
The way to implement this is to first support a simpler variant, namely a reset policy {test, K, Ms}
which will later expand into the notion of [{delay, Ms}, {test, K}]
internally. It is a nice stepping stone toward the final solution we want, and we can then test the fuse behavior without having to implement all of the language in the first place.
First, we need to update the model. We need to stop tracking the blown
state and directly calculate it from the melt history. This is more functional and has fewer moving parts. This allows us to add another way to track that the fuse is in a testing
state, which becomes a special state of its own.
But by removing the blown
tracking first and using the melt_history
we can avoid having to specify a lot of the interaction between the two states. This hopefully simplifies the model and makes it easier to get correct.
The model update is #11 and it vastly simplifies the model.
The next step is to add a tracking in the model of a fuse being in the test state
:
{test, K, Ms}
keyword.K
requests through.It turns out we cannot use #11, so it is back to the drawing board, probably by accepting the complexity of the model and then adding the {test, K, Ms}
setting.
The test
command is fundamentally hard to add:
fuse_server
. Otherwise we cannot know when we have given out enough test
requests.K
tests. If no melt
is heard within Ms
milliseconds, heal the fuse. But this requires you know what the typical timeout is, and when you are in that window of a possible timeout, you can't heal the fuse in the meantime.This is possible to model all of these considerations, but it gets quite nasty since we will need eqc_component
?BLOCK
style in order to handle this as well as sessioned ask/1
commands which may be followed by a melt/1
. In other words, adding a {test, ...}
style check has some quite severe deep implications on fuse. We need to think a bit more on this. The {gradual, ...}
solution is not limited by these considerations at all, since it only cares about normal melt errors.
New idea, inspired by @lehoff in a loose way:
The thing that is hard with a {test, K, Ms}
style command is that it starts to encode a lot of policy about what a test is into the fuse_server
. This is hard to model, since all of a sudden, the model needs to take care of not only the fuse_server
, but also any calling process, as they become part of the correctness of the system as a whole.
But if we supported {barrier, Term}
we could make things work out I think:
{barrier, Term}
stops the fuse at that point in its processing. It returns {barrier, Term, Token}
when asked. This is a cue to go to your own layered process and solve the barrier. Once you have solved the barrier tests, you either melt the fuse so it explodes again, or you call fuse:unlock_barrier(Name, Token)
which unlocks the barrier and the fuse continues its processing.In turn, we can now support any model you can think up outside the scope of the fuse system itself. A {test, K, Ms}
model is fairly easy to set up, if you just keep a process around to resolve letting K
callers try and then unlock if all of them are good, or melt the fuse into the blown state if any of them are bad. It keeps the fuse model fairly simple by itself. It punts the hard policy part to a system of your own, and it can support many complex models easily. Most importantly, it is easy to model.
With this model, the caller process has to call {barrier, Term}
to control the flow in a certain point in time, after a blown
, for example, and after the reset timeout the state when asked would be {barrier, Term, Token}
. Or we could configure the reset to be a automatically the {barrier, Term}
after the reset timeout?
Now we have {reset, Ms}
for standard
fuses and we could have a controlled_fuse
with {reset, Ms, {barrier,Term}}
after the reset timeout the fuse will go to this state and during this state (until we call fuse:unlock_barrier(Name, Token)
) any melt
will pass to the blown
state (other wise we will not fail fast and have to wait to get all melts
again)
An update: I added timing to the EQC model in #12 which has uncovered some bugs in fuse
w.r.t. timing. I think @zeeshanlakhani / @lehoff might be interested in these and fixes of them going forward, so just pinging them :)
I'll probably build a point-release, but I'm not sure it will fly on release 16 yet. Backwards compatibility should be fairly easy though since there should be a time-compat module and a rand-compat module for handling the backporting. The rest of the code should be R16 safe, I think.
The problem is somewhat benign: if a fuse is melted too much just as it blows, then more than a single timer is set on the fuse. This can lead to fun situations when the timer clears again, but I don't think it will. However, that world is somewhat undefined behavior :)
If you want me to track the state explicitly for, say, Basho, just open an issue on this repo.
On this issue though: #12 implements the necessary timing scaffolding which eventually lets us model the proposal in this issue. It is a prerequisite step since it puts timing under the wings of the model and we now control time explicitly in a EQC component based cluster.
With #12 implemented, we can start modeling the real code for the system. This comment describes what is needed:
First, we must introduce a notion in the model of a command list. Given a fuse, it's reset policy is given as a list of commands, and we have a "next" command which explains the current state of the fuse. If we have, say, [{delay, 30}]
and we get a timing event for the delay, we proceed, to the state []
and then carry out this by healing the fuse.
When the fuse is blown, we start processing this list. We introduce an internal call to "process commands" which then places the model in the correct state. We can implement this in the model without altering the SUT, and we can make it "backwards compatible". Once this is in place, we have the necessary stepping stone to implement the remainder of commands.
{delay, Ms}
will set a timer event, and then proceed. If the timer triggers, the head must be {delay, Ms}
and it is chopped off the command list and we process the remaining command list.heal
is not needed. The empty list encodes heals.{gradual, Level}
is implemented by altering the fuse state to a gradual fuse, then proceeding by executing the next command.{barrier, Term}
puts the fuse into the barrier state. The fuse will keep being in this state until an unlock_barrier/2
command is executed agains the fuse. Once unlocked, the next state proceeds to execute.We have the first part of a command processor for this issue. It is implemented in #14 and is going into the model soon.
We have delay
implemented, but we still need to figure out what the correct thing to do on grudual
and barrier
in the model.
The way you handle this is to alter lookup_fuse
(an internal command in the model) such that it looks at the current fuse state and hands back the current fuse state to the system. The fuse state is then coded on being part of a data type:
Command processing changes these states accordingly in the fuse, and lookup_fuse
consults these states to figure out what is happening.
New finding:
We need to simplify the model first. To do this, we must introduce a new record #fuse{}
which tracks the state of a fuse. And then we need to take the current states, disabled and blown and move them into the fuse record itself.
Once this is complete, it is far easier to handle the above scheme, without going mad trying to do so. Also, the simplification will make it easier to extend the system later on.
Managed to simplify half of the model now. Still need to simplify disabled
and melt
states into fuses.
Disabled has now been folded into the fuse state. Still need to work on melt
.
hi @jlouis , thank you for this very useful library. I was wondering if the half-open state of circuit breaker eventually got implemented in fuse. I see one or two references to gradual
in code such as this. I don't see this documented in the API reference or tutorial though. Can you please confirm if half-open state is supported by fuse? Thanks.
@ahmadferdous unfortunately it's not there, yet. There are some test-code scaffolding in place to make sure it will work, but the code itself doesn't really support this notion as of now. It's one of those things I've been interested in doing at some point, but I got distracted with other stuff for a couple of years, heh.
Necromancy! Hitting this with a Necrobolt of work :)
The model has been brought up-to-date, and we are now processing "standard" fuses as a command list. In particular delay
has been implemented. The next part is to implement the gradual
command I think.
Great news
Hi,
I am using fuse to control the access to a backend service (DAL API). It would be nice that we have some good way of passing from blown to ok gradually. Otherwise, if the backend is under load (502/503), for example, the requests will be back on charge all at the same time, after the "heal" interval and can cause problems again.
Thank you,
(I will be able to implement a solution if you think that may be useful)
Pedro