Open austin-space opened 1 year ago
Brainstorm thought I had while looking at this - we would need to store on the autoscaler CRD if the scheduling had happened or not (past 3 entries?) just in case the controller goes down over the scheduling time period, so it can do the work it was supposed to do before.
I'm not sure that you would need to. This is intended to be stateless, so it just pretends that it's a normal buffer autoscaler. It just so happens that the current buffer rule would be determined by the highest priority active override.
That being said, it's probably worth updating the CRD with the last scale rule used in order to get a sense of why it might be scaling to some level(or not). In that case a name field would be needed for any overrides.
Oh I see - so you the actual result would be something like
default buffer: 10% Between 1am and 5am: scale buffer 5% (low peak time) Between 1pm and 5pm: scale buffer 20% (high peak time)
Then on each autoscaler loop you would just be looking at which buffer to apply. That makes a lot of sense 👍🏻
I had it in my head more of a "at 1pm exactly do this thing", but this is better. I like it!
Yeah, that seemed like the way to keep the changes to the autoscaler as minimal and resilient as possible.
Your comment did remind me that there's an important distinction here between "scheduled once" and "scheduled recurring". Ideally I want to be able to do both (which could be as easy as specifying just a time when you want daily recurrence, and a datetime when you want a single occurrence), but I suspect that there will be a desire for day of week recurrence. For now I feel like that can probably be avoided for simplicity, since that complexity can spiral out of control quickly.
You could also enable a change in min/max as well during a time period?
Yeah, I think that would be ideal. That way I can articulate the widest range of scheduled events. For example, you provided a very good example of when the % buffer might be useful, but I might actually want to set a lower floor as well at night. Likewise for a scheduled event, I may want to do something like:
That way we don't try to aggressively scale up as people join the event, we just fill out a fleet that we've already allocated. This puts less load on the cluster during a high stress event, and saves money on overall usage.
It would probably make sense to either fall back to the default or reject the CRD if there are any values not provided so as to avoid configuration mistakes.
So coming from the conversation in #3718 (@zmerlynn , @aRestless, @nrwiersma), I wanted to capture some thoughts here, on a potential "policy chain" implementation. I think in actual examples, so this lets me flush things out.
So first thought would backward compatibility, but I think that's easy with the CRD constructs we already exist, we just add a new type
parameters to FleetAutoscalerSpec
, and default it to Policy
- so you could have:
apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
name: simple-game-server-autoscaler
spec:
fleetName: simple-game-server
type: Policy # this is the new bit, and "policy" would be the default.
policy:
type: Buffer
buffer:
bufferSize: 2
minReplicas: 0
maxReplicas: 10
But if you wanted to have a chain, then type would be "chain", like so:
apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
name: simple-game-server-autoscaler
spec:
fleetName: simple-game-server
type: Chain # so now populate the `chain` child element.
chain:
- type: Webhook
webhook:
service:
name: autoscaler-webhook-service
namespace: default
path: scale
- policy:
type: Buffer
buffer:
bufferSize: 2
minReplicas: 0
maxReplicas: 10
Not 100% sure how to capture the failthrough on a Webhook (maybe it doesn't need to be captured)? From here, we can probably add different type of Scheduling types? (start and end dates, recurring, cron? etc?) that would allow scheduling.
100% a sacrificial draft, so feel free to play with it, but WDYT?
Let's limit the design for now just to the fallback discussion from #3718 / #3686 - we'll have someone working soon on the scheduling part.
However, lifting a bit from the conversation at https://github.com/googleforgames/agones/pull/3718#issuecomment-2032378428, @arestless was proposing: a policy falls through to the next policy either if it fails (Webhook or whatever else we might add that has error returns), or if some conditional isn't met. That seems easy enough to reason about, and means we would basically have linear cascading of policies that were either:
we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)
Or that's an exercise for the user -- don't put anything at the end you don't want to be the last chance at success.
I don't think we can force it?
we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)
I think that "change nothing" is a reasonable default policy if all other steps error or their conditions weren't met. After all, it's the only action that makes sense for an empty list of policies (if that's something we want to allow).
There was another topic that was brought up, and that's if a "chain of chains" is valid.
A chain of chains would make it very easy to shoot oneself in the foot by building nested logical constructs that become hard to reason about. In my opinion there is a healthy pragmatism in saying that anything that gets a little bit more complex simply belongs into a webhook. And since this extension point exists, native support for other functionality might only be warranted for features that cannot be put into the webhook (e.g. fallback for webhook failing) or functionality that is likely to see widespread usage (e.g. the schedules proposed here).
I'm struggling to even come up with a use case for nested chains - but maybe someone else has thoughts on that.
I think that "change nothing" is a reasonable default policy if all other steps error or their conditions weren't met. After all, it's the only action that makes sense for an empty list of policies (if that's something we want to allow).
Agreed. Though should we allow a chain of zero? Maybe - it allows someone to construct an object they can manipulate later without having to insert, I.e. if you imagine having scheduling automation that inserts your schedule rules and maybe you just don't have any? Certainly seems fair for a chain of zero just to mean "do nothing".
And since this extension point exists, native support for other functionality might only be warranted for features that cannot be put into the webhook (e.g. fallback for webhook failing) or functionality that is likely to see widespread usage (e.g. the schedules proposed here).
Agreed. With branching chains I'd be awfully tempted to support a unit test element as well. 😆
Sounds like we have consensus that:
and possible consensus that chains may be empty even if it's useless.
I concur on the above as well. Only thing I'd be explicit about is to put it behind a feature gate, just so we have room to experiment / change things if we need to.
Though should we allow a chain of zero? Maybe
I think we should - I don't see any reason not to give people the option. We already let people set a fleet name that is invalid.
Off topic for the scheduled use case, but related to the policy chain: A case I've been running into recently, and it seems like the policy chain could solve(depending on what is allowed in the conditions for falling through) is that I want to set an X% buffer policy, but I also want to make sure that there are N ready game servers available. For example if the peak number of allocated game servers in a cluster is 100 game servers, I may want to have a buffer of 10%. However, when the cluster is at it's lowest usage, it might only have 10 allocated game servers, which would leave only 1 game server in the ready state. I'd love to be able to say "if allocated game servers are below 50, keep a ready buffer of 5, otherwise keep a buffer of 10%" or something to that effect.
That particular kind of feature could be a bit of a footgun in that if a user is not careful to make those boundaries somewhat smooth, then they could end up doing a lot more scaling up and down in situations where they hover around the boundary(e.g. a user says 10% for under 100 allocated instances, and 5% for more would mean that there would be a scale down operation as soon as they cross 100 allocated instances). However, I don't think that's too terrible of an outcome, especially if the autoscaling interval isn't set too low.
@austin-space Seems like it's something that could be implement as a conditional in the chain, though we'd have to be careful on how we define it so that it's deterministic at evaluation. We'll have a resource working on the scheduling case soon, I can see if we can work on that as a follow-on.
Reading about the "footgun" aspects, and the need to debug complex setups, I'm wondering if there's a need to track some aspects of the last scaling decision in the FleetAutoscaler status, e.g. inputs, results, errors that occurred, policy (in the chain) that was actually used.
Or would that be a pure logging topic to you folks?
This seems like an appropriate use of the Kubernetes event stream on the Autoscaler - which we already do.
We don't want to spam it too much though, so we should be judicious on what we add as an event - but it should track state changes - and especially if something fails (i.e. if a webhook fails, we should definitely log that as a specific event).
Random thought for today - we actually have prior art in Agones for "do the things in a list, in order, if the first once fails, do the next one" - so the concept is definitely not foreign to the project.
In GameServerAllocationSpec we do exactly this with selectors
.
Almost makes me wonder if chain
should be selectors
.. but it doesn't quite fit.
@nrwiersma Someone will be working on scheduled autoscalers starting in 3-4 weeks - are you interested in re-driving https://github.com/googleforgames/agones/pull/3718 with the above discussion prior to that? If not, do you mind if we adapt it? Thanks!
@zmerlynn You are welcome to adapt it to your needs.
'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '
Setting awaiting-maintainer
since this is on our roadmap.
TLDR: This document outlines the design for a new feature in Agones that enables scheduled autoscaling for Fleet Autoscalers. This functionality allows users to define time windows for automatic adjustments to game server fleets based on predictable events or usage patterns.
This section defines the structure of scheduling and applying a policy within an Agones Fleet Autoscaler. It allows you to control when the autoscaler considers scaling the game server fleet based on your specified criteria.
The format includes three parameters for scheduling:
startCron
) to define a schedule (e.g., daily, weekly) at which point the policy can be applied. Additionally, duration
(optional) specifies the length of time for which the policy should be applied after the scheduled start time. By default, if the duration field isn't specified it'll be interpreted as forever (once startCron has passed, the policy is considered active forever, until the end time has passed). Also, timezone
(optional) specifies the timezone used for the startCron, which will be UTC by default....
schedule:
between:
# Start checking to apply the policy at this time, must conform to RFC3339.
start: "2024-02-20T16:04:00Z" # optional
# End checking to apply the policy at this time, must conform to RFC3339.
end: "2024-02-24T16:04:00Z" # optional
activePeriod:
# Timezone to use for the startCron field.
# By default this field will be UTC if not specified.
# Set the timezone to EST.
timezone: "America/New_York"
# Start applying the bufferSize everyday at 1:00 AM
startCron: "0 1 * * 0" # optional
# Only apply the bufferSize for this 5 hours
duration: "5h" # optional
...
This format defines a Fleet Autoscaler policy that utilizes a chain structure for applying scaling logic based on different conditions. It leverages the concept of "falling through the chain" to achieve flexible scheduling and scaling behavior.
Key Elements:
Chain Policy: The type
of the overall policy is set to Chain
, indicating a sequence of conditions/schedules to be evaluated.
Chain Entry: Each entry within the chain
list contains an optional condition and a corresponding required policy to be applied if the condition/schedule is valid. If a chain entry has no schedule or condition, the corresponding policy will always be applied when the specified chain element is indexed. A chain entry contains the following:
ID: Each chain entry has an id
(optional) for easier identification and a type
of Schedule
. By default the id
will be the index of the chain entry within the chain (e.g. First entry is 0, second entry is 1).
Schedule a chain entry can have a schedule
(optional) .
Policy Each chain entry has a policy
(required) defines the specific policy the FleetAutoscaler should execute to adjust the fleet. The following are the only allowed policies under this field: Buffer, Counters/Lists, Webhook
Three Execution Flows For Chain Iteration
Schedule/Condition Met - Policy Applied:
Schedule/Condition Not Met - Fall Through the Chain:
No Schedule Defined - Default Policy Application:
Importance of Chaining: By chaining multiple elements with different schedules and policies, you can create a layered scaling logic. The FleetAutoscaler keeps checking elements until it finds an active schedule and applies the corresponding policy for scaling. This approach allows for more nuanced scaling behavior based on various conditions throughout the day or week. If no the schedule is applicable, then the fleet autoscaler will not apply any policy unless a default policy is specified or a chain entry's schedule becomes eligible.
apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
name: simple-game-server-chain-autoscaler
spec:
policy:
type: Chain # Chain based policy for autoscaling.
chain:
# Id of chain entry.
# Optional.
- id: "weekday"
type: Schedule # Schedule based condition.
schedule:
between:
# The policy becomes eligible for application starting on
# Feb 20, 2024 at 4:04 PM EST.
# Optional.
start: "2024-02-20T16:04:04-05:00"
# The policy becomes ineligible for application on
# Feb 23, 2024 at 4:04 PM EST.
# Optional.
end: "2024-02-23T16:04:04-05:00" # optional
activePeriod:
# Timezone to be used for the startCron field.
# Optional.
timezone: "America/New_York"
# Start applying the bufferSize everyday at 1:00 AM EST.
# (Only eligible starting on Feb 20, 2024 at 4:04 PM.)
# Optional.
startCron: "0 1 * * 0"
# Only apply the bufferSize for this 5 hours
# Optional.
duration: "5h"
# Policy to be applied when the condition is met.
# Required.
policy:
type: Buffer
buffer:
bufferSize: 50
minReplicas: 100
maxReplicas: 2000
# Id of chain entry.
# Optional.
- id: "weekend"
type: Schedule
schedule:
between:
# The policy becomes eligible for application starting on
# Feb 24, 2024 at 4:05 PM EST.
# Optional.
start: "2024-02-24T16:04:05-05:00"
# The policy becomes ineligible for application starting on
# Feb 26, 2024 at 4:05 PM EST.
# Optional.
end: "2024-02-26T16:04:05-05:00"
activePeriod:
# Timezone to be used for the schedule.
timezone: "America/New_York"
# Start applying the bufferSize everyday at 1:00 AM EST.
# (Only eligible starting on Feb 24, 2024 at 4:05 PM EST)
# Optional.
startCron: "0 1 * * 0"
# Only apply the bufferSize for this 7 hours
# Optional.
duration: "7h"
# Policy to be applied when the condition is met.
# Required.
policy:
type: Counter
counter:
key: rooms
bufferSize: 10
minCapacity: 500
maxCapacity: 1000
# Id of chain entry.
- id: "default"
# Policy will always be applied when no other policy is applicable.
# Required.
policy:
type: Buffer
buffer:
bufferSize: 5
minReplicas: 100
maxReplicas: 2000
Design LGTM! A couple of nits:
It is recommended to use ISO8601 time format if you would like to specify a timezone. If a timezone is specified and RFC3339 format is used, the formatted string will take precedence if the timezones differ.
I would be explicit and use the code formatting to help guide the reader here: e.g. "It is recommended to use ISO8601 time format without a time zone if you would like to specify a timezone using .timezone
. If .timezone
is specified and .between.start
or .between.end
includes a timezone as well, the formatted string will take precedence if the timezones differ." Note that ISO8601 can include a timezone, so it's one reason I'm being pedantic here.
Schedule a chain entry can have a schedule (optional) contains a:
This section seems redundant with the definition of the schedule above in the design, maybe drop it or shorthand it more?
Evaluation Time Window (between): This uses start and end datetimes that must conform to RFC3339 or ISO8601 to define time range. The policy application window will only be evaluated within this window. (e.g. Start evaluating the policy applications window 6 months from now and stop 12 months from now). It is recommended to use ISO8601 time format without a time zone if you would like to specify a timezone using .timezone. If .timezone is specified and .between.start or .between.end includes a timezone as well, the formatted string will take precedence if the timezones differ.
Rather than precedence - could we fail validation if a user provides both? Basically you could do one or the other, but not both?
Policy Application Window (activePeriod)
I'm assuming activePeriod
is optional if a between
is not specified - and will default to always
essentially?
e.g. for CUJ No. 1 "E.g. On the 16th of January, 2025 Make the buffer size 20%, instead of the default 10%" - there's no need for a activePeriod
.
Rather than precedence - could we fail validation if a user provides both? Basically you could do one or the other, but not both?
Yeah, I like that, so if a user provides a .timezone
and a start/end time with a timezone, validation fails. We can do the same w/ CRON_TZ/TZ, if the user decides to specify a TZ for the .activePeriod.startCron
and it differs from the .timezone
, validation fails.
I'm assuming
activePeriod
is optional if abetween
is not specified - and will default toalways
essentially?
Yes, exactly. If the user really wanted to they could set the .activePeriod.startCron
to " *" and leave the duration
empty, which would have the same effect as well.
@indexjoseph @zmerlynn are there any outstanding items, or can we mark this as complete?
Is your feature request related to a problem? Please describe. During a scheduled in-game events or new version releases, we see pretty rapid spikes in usage of either an already high use fleet or a newer unused fleet. Both of these events we know the timing of, and our current options are either:
Describe the solution you'd like Introduce the concept of scheduled overrides that contain the following:
Then on autoscaling evaluation:
This would allow us to set special scaling windows for events or new version releases. A further extension could be to allow recurring windows to do time of day scheduling so that we could have a buffer window in the off hours and a percentage during higher usage, which could help with issues like that described in #2504
Describe alternatives you've considered As described at the top, we can either prescale agressively, which either results in us adjusting the autoscaler directly, or using the webhook autoscaler.