purpleidea / mgmt

Next generation distributed, event-driven, parallel config management!
https://purpleidea.com/tags/mgmtconfig/
GNU General Public License v3.0
3.47k stars 308 forks source link

Limit/Burst need improving #718

Open purpleidea opened 11 months ago

purpleidea commented 11 months ago

I think the limit and burst meta params are neat, but at the minimum they need to be added to the recently added MetaState ( https://github.com/purpleidea/mgmt/blob/master/engine/metaparams.go) struct for persistence between graph transitions.

Once this is done, please verify they're working properly (including in the "satellite" event loops ( https://github.com/purpleidea/mgmt/blob/master/engine/graph/actions.go#L452 ), and then improve the documentation in metaparams.go file: https://github.com/purpleidea/mgmt/blob/master/engine/metaparams.go#L56

I used this code to check they work across many graph transitions. You'll note it doesn't work because of graph swap!

import "fmt"
#import "datetime"
import "test"

#$count = datetime.now()
$count = test.fastcount()

file "/tmp/mgmt/" {
    state => $const.res.file.state.exists,
}

file "/tmp/mgmt/mgmt-count0" {
    content => fmt.printf("count is: %d\n", $count),
    state => $const.res.file.state.exists,

    Meta:limit => 1.0, # 1 event per second max
    Meta:burst => 1, # 1?
}

file "/tmp/mgmt/mgmt-count1" {
    content => fmt.printf("count is: %d\n", $count),
    state => $const.res.file.state.exists,

    Meta:limit => 2.0, # 2 events per second max
    Meta:burst => 1, # 1?
}

file "/tmp/mgmt/mgmt-count2" {
    content => fmt.printf("count is: %d\n", $count),
    state => $const.res.file.state.exists,

    Meta:limit => 0.5, # 0.5 events per second max
    Meta:burst => 1, # 1?
}
1garo commented 8 months ago

I would like to help support the project, it seems really interesting and challenging. I'm currently watching the last talk and plan to read some of the blog posts, would love already start to help but still understanding mgmt.

purpleidea commented 8 months ago

@1garo Please ping on IRC or email, not here. Thanks!

ffrank commented 4 months ago

This is a very nice example. Working on this, I found several issues, even with rate limited resources in static graphs that don't get replaces at a high frequency.

After resolving those, I found that now the Graph Worker is racing against the Functions Engine (?) like this:

  1. Worker runs its loop and waits for the limiter delay to expire
  2. Delay expires, Worker enters retry loop
  3. new Graph gets emitted, Worker exits before running Process()

The rate limiter token is lost, and the resource must wait another cycle. With code like in this example, churning out graphs based on test.fastcount, this race is lost by the Worker surprisingly often.

My conclusion for now is that we probably need to persist the tokens from the rate limiter in a way that even survives replacement of the graph.