googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.12k stars 818 forks source link

Feature Request: Scheduled Autoscalers #3008

Open austin-space opened 1 year ago

austin-space commented 1 year ago

[!NOTE] If you want to skip a bunch of discussion and see our @indexjoseph's proposed design, see https://github.com/googleforgames/agones/issues/3008#issuecomment-2174571364

Is your feature request related to a problem? Please describe. During a scheduled in-game events or new version releases, we see pretty rapid spikes in usage of either an already high use fleet or a newer unused fleet. Both of these events we know the timing of, and our current options are either:

  1. Prescale agressively: this works but means that unless we are building scheduling logic ourselves to undo the additional scale afterwards, we're paying for a lot of unused capacity.
  2. Webhook autoscaler: this is a viable solution, but requires us to build a service to do so.

Describe the solution you'd like Introduce the concept of scheduled overrides that contain the following:

  1. a start time(in UTC)
  2. an end time(in UTC)
  3. a priority int(higher the better, much like PriorityClasses)
  4. a buffer autoscaler block

Then on autoscaling evaluation:

  1. collect those overrides for which we are between the start and end time a. if there are no matching overrides, just use the default autoscaling rule
  2. of those select the highest priority
  3. apply that buffer autoscaling rule instead of the default

This would allow us to set special scaling windows for events or new version releases. A further extension could be to allow recurring windows to do time of day scheduling so that we could have a buffer window in the off hours and a percentage during higher usage, which could help with issues like that described in #2504

Describe alternatives you've considered As described at the top, we can either prescale agressively, which either results in us adjusting the autoscaler directly, or using the webhook autoscaler.

markmandel commented 1 year ago

Brainstorm thought I had while looking at this - we would need to store on the autoscaler CRD if the scheduling had happened or not (past 3 entries?) just in case the controller goes down over the scheduling time period, so it can do the work it was supposed to do before.

austin-space commented 1 year ago

I'm not sure that you would need to. This is intended to be stateless, so it just pretends that it's a normal buffer autoscaler. It just so happens that the current buffer rule would be determined by the highest priority active override.

That being said, it's probably worth updating the CRD with the last scale rule used in order to get a sense of why it might be scaling to some level(or not). In that case a name field would be needed for any overrides.

markmandel commented 1 year ago

Oh I see - so you the actual result would be something like

default buffer: 10% Between 1am and 5am: scale buffer 5% (low peak time) Between 1pm and 5pm: scale buffer 20% (high peak time)

Then on each autoscaler loop you would just be looking at which buffer to apply. That makes a lot of sense 👍🏻

I had it in my head more of a "at 1pm exactly do this thing", but this is better. I like it!

austin-space commented 1 year ago

Yeah, that seemed like the way to keep the changes to the autoscaler as minimal and resilient as possible.

Your comment did remind me that there's an important distinction here between "scheduled once" and "scheduled recurring". Ideally I want to be able to do both (which could be as easy as specifying just a time when you want daily recurrence, and a datetime when you want a single occurrence), but I suspect that there will be a desire for day of week recurrence. For now I feel like that can probably be avoided for simplicity, since that complexity can spiral out of control quickly.

markmandel commented 1 year ago

You could also enable a change in min/max as well during a time period?

austin-space commented 1 year ago

Yeah, I think that would be ideal. That way I can articulate the widest range of scheduled events. For example, you provided a very good example of when the % buffer might be useful, but I might actually want to set a lower floor as well at night. Likewise for a scheduled event, I may want to do something like:

  1. Set up a new fleet with the same image type, and a label that indicates it's servers for that event.
  2. 10 minutes before the event have a scheduled scaling event kick in that sets the floor to the amount of additional servers we think we need for the event.
  3. A while into the event, flip the minimum back off so that as people leave, the fleet naturally winds down.
  4. Scale down to 0 after the event is over.

That way we don't try to aggressively scale up as people join the event, we just fill out a fleet that we've already allocated. This puts less load on the cluster during a high stress event, and saves money on overall usage.

austin-space commented 1 year ago

It would probably make sense to either fall back to the default or reject the CRD if there are any values not provided so as to avoid configuration mistakes.

markmandel commented 8 months ago

So coming from the conversation in #3718 (@zmerlynn , @aRestless, @nrwiersma), I wanted to capture some thoughts here, on a potential "policy chain" implementation. I think in actual examples, so this lets me flush things out.

So first thought would backward compatibility, but I think that's easy with the CRD constructs we already exist, we just add a new type parameters to FleetAutoscalerSpec, and default it to Policy - so you could have:

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: simple-game-server-autoscaler
spec:
  fleetName: simple-game-server
  type: Policy # this is the new bit, and "policy" would be the default.
  policy:
    type: Buffer
    buffer:
      bufferSize: 2
      minReplicas: 0
      maxReplicas: 10

But if you wanted to have a chain, then type would be "chain", like so:

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: simple-game-server-autoscaler
spec:
  fleetName: simple-game-server
  type: Chain # so now populate the `chain` child element.
  chain:
    - type: Webhook
        webhook:
          service:
          name: autoscaler-webhook-service
          namespace: default
          path: scale
    - policy:
        type: Buffer
        buffer:
          bufferSize: 2
          minReplicas: 0
          maxReplicas: 10

Not 100% sure how to capture the failthrough on a Webhook (maybe it doesn't need to be captured)? From here, we can probably add different type of Scheduling types? (start and end dates, recurring, cron? etc?) that would allow scheduling.

100% a sacrificial draft, so feel free to play with it, but WDYT?

zmerlynn commented 8 months ago

Let's limit the design for now just to the fallback discussion from #3718 / #3686 - we'll have someone working soon on the scheduling part.

However, lifting a bit from the conversation at https://github.com/googleforgames/agones/pull/3718#issuecomment-2032378428, @arestless was proposing: a policy falls through to the next policy either if it fails (Webhook or whatever else we might add that has error returns), or if some conditional isn't met. That seems easy enough to reason about, and means we would basically have linear cascading of policies that were either:

markmandel commented 8 months ago

we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)

Or that's an exercise for the user -- don't put anything at the end you don't want to be the last chance at success.

I don't think we can force it?

aRestless commented 8 months ago

we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)

I think that "change nothing" is a reasonable default policy if all other steps error or their conditions weren't met. After all, it's the only action that makes sense for an empty list of policies (if that's something we want to allow).

There was another topic that was brought up, and that's if a "chain of chains" is valid.

A chain of chains would make it very easy to shoot oneself in the foot by building nested logical constructs that become hard to reason about. In my opinion there is a healthy pragmatism in saying that anything that gets a little bit more complex simply belongs into a webhook. And since this extension point exists, native support for other functionality might only be warranted for features that cannot be put into the webhook (e.g. fallback for webhook failing) or functionality that is likely to see widespread usage (e.g. the schedules proposed here).

I'm struggling to even come up with a use case for nested chains - but maybe someone else has thoughts on that.

zmerlynn commented 8 months ago

I think that "change nothing" is a reasonable default policy if all other steps error or their conditions weren't met. After all, it's the only action that makes sense for an empty list of policies (if that's something we want to allow).

Agreed. Though should we allow a chain of zero? Maybe - it allows someone to construct an object they can manipulate later without having to insert, I.e. if you imagine having scheduling automation that inserts your schedule rules and maybe you just don't have any? Certainly seems fair for a chain of zero just to mean "do nothing".

And since this extension point exists, native support for other functionality might only be warranted for features that cannot be put into the webhook (e.g. fallback for webhook failing) or functionality that is likely to see widespread usage (e.g. the schedules proposed here).

Agreed. With branching chains I'd be awfully tempted to support a unit test element as well. 😆

Sounds like we have consensus that:

and possible consensus that chains may be empty even if it's useless.

markmandel commented 8 months ago

I concur on the above as well. Only thing I'd be explicit about is to put it behind a feature gate, just so we have room to experiment / change things if we need to.

markmandel commented 8 months ago

Though should we allow a chain of zero? Maybe

I think we should - I don't see any reason not to give people the option. We already let people set a fleet name that is invalid.

austin-space commented 8 months ago

Off topic for the scheduled use case, but related to the policy chain: A case I've been running into recently, and it seems like the policy chain could solve(depending on what is allowed in the conditions for falling through) is that I want to set an X% buffer policy, but I also want to make sure that there are N ready game servers available. For example if the peak number of allocated game servers in a cluster is 100 game servers, I may want to have a buffer of 10%. However, when the cluster is at it's lowest usage, it might only have 10 allocated game servers, which would leave only 1 game server in the ready state. I'd love to be able to say "if allocated game servers are below 50, keep a ready buffer of 5, otherwise keep a buffer of 10%" or something to that effect.

That particular kind of feature could be a bit of a footgun in that if a user is not careful to make those boundaries somewhat smooth, then they could end up doing a lot more scaling up and down in situations where they hover around the boundary(e.g. a user says 10% for under 100 allocated instances, and 5% for more would mean that there would be a scale down operation as soon as they cross 100 allocated instances). However, I don't think that's too terrible of an outcome, especially if the autoscaling interval isn't set too low.

zmerlynn commented 8 months ago

@austin-space Seems like it's something that could be implement as a conditional in the chain, though we'd have to be careful on how we define it so that it's deterministic at evaluation. We'll have a resource working on the scheduling case soon, I can see if we can work on that as a follow-on.

aRestless commented 8 months ago

Reading about the "footgun" aspects, and the need to debug complex setups, I'm wondering if there's a need to track some aspects of the last scaling decision in the FleetAutoscaler status, e.g. inputs, results, errors that occurred, policy (in the chain) that was actually used.

Or would that be a pure logging topic to you folks?

markmandel commented 7 months ago

This seems like an appropriate use of the Kubernetes event stream on the Autoscaler - which we already do.

We don't want to spam it too much though, so we should be judicious on what we add as an event - but it should track state changes - and especially if something fails (i.e. if a webhook fails, we should definitely log that as a specific event).

markmandel commented 7 months ago

Random thought for today - we actually have prior art in Agones for "do the things in a list, in order, if the first once fails, do the next one" - so the concept is definitely not foreign to the project.

In GameServerAllocationSpec we do exactly this with selectors.

Almost makes me wonder if chain should be selectors .. but it doesn't quite fit.

zmerlynn commented 7 months ago

@nrwiersma Someone will be working on scheduled autoscalers starting in 3-4 weeks - are you interested in re-driving https://github.com/googleforgames/agones/pull/3718 with the above discussion prior to that? If not, do you mind if we adapt it? Thanks!

nrwiersma commented 7 months ago

@zmerlynn You are welcome to adapt it to your needs.

github-actions[bot] commented 6 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

markmandel commented 5 months ago

Setting awaiting-maintainer since this is on our roadmap.

indexjoseph commented 5 months ago

Scheduled Fleet Autoscaling Design

TLDR: This document outlines the design for a new feature in Agones that enables scheduled autoscaling for Fleet Autoscalers. This functionality allows users to define time windows for automatic adjustments to game server fleets based on predictable events or usage patterns.

Requirements

Critical User Journeys

Proposed Solution

  1. FleetAutoscaler CRD Field Expansion: Introduce new fields within the existing Fleet Autoscaler Custom Resource Definition (CRD) to accommodate scheduled scaling configurations. Implement a chain policy - defines a chain, indicating a sequence of conditions (associated with policies) to be evaluated. Implement schedule for chains - defines the scheduling criteria for applying scaling logic.
  2. Feature Gate Implementation: Introduce a feature gate to control access to the new scheduled autoscaling functionality.

Proposed FleetAutoscaler CRD Changes

This section defines the structure of scheduling and applying a policy within an Agones Fleet Autoscaler. It allows you to control when the autoscaler considers scaling the game server fleet based on your specified criteria.

The format includes three parameters for scheduling:

  1. Evaluation Time Window (between): This uses start and end datetimes that must conform to RFC3339 to define time range. The policy application window will only be evaluated within this window. (e.g. Start evaluating the policy applications window 6 months from now and stop 12 months from now).
  2. Policy Application Window (activePeriod): This uses a cron expression (startCron) to define a schedule (e.g., daily, weekly) at which point the policy can be applied. Additionally, duration (optional) specifies the length of time for which the policy should be applied after the scheduled start time. By default, if the duration field isn't specified it'll be interpreted as forever (once startCron has passed, the policy is considered active forever, until the end time has passed). Also, timezone (optional) specifies the timezone used for the startCron, which will be UTC by default.
...
schedule:
  between:
    # Start checking to apply the policy at this time, must conform to RFC3339. 
    start: "2024-02-20T16:04:00Z" # optional
    # End checking to apply the policy at this time, must conform to RFC3339.
    end: "2024-02-24T16:04:00Z" # optional
  activePeriod:
    # Timezone to use for the startCron field.
    # By default this field will be UTC if not specified.
    # Set the timezone to EST.
    timezone: "America/New_York"
    # Start applying the bufferSize everyday at 1:00 AM 
    startCron: "0 1 * * 0" # optional
    # Only apply the bufferSize for this 5 hours
    duration: "5h" # optional   
...

Proposed Chain Policy Implementation

This format defines a Fleet Autoscaler policy that utilizes a chain structure for applying scaling logic based on different conditions. It leverages the concept of "falling through the chain" to achieve flexible scheduling and scaling behavior.

Key Elements:

Three Execution Flows For Chain Iteration

Schedule/Condition Met - Policy Applied:

Schedule/Condition Not Met - Fall Through the Chain:

No Schedule Defined - Default Policy Application:

Importance of Chaining: By chaining multiple elements with different schedules and policies, you can create a layered scaling logic. The FleetAutoscaler keeps checking elements until it finds an active schedule and applies the corresponding policy for scaling. This approach allows for more nuanced scaling behavior based on various conditions throughout the day or week. If no the schedule is applicable, then the fleet autoscaler will not apply any policy unless a default policy is specified or a chain entry's schedule becomes eligible.

Chain Example

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
    name: simple-game-server-chain-autoscaler
spec:
    policy:
      type: Chain # Chain based policy for autoscaling.
      chain:
        # Id of chain entry.
        # Optional.
      - id: "weekday"
        type: Schedule # Schedule based condition.
        schedule:
          between:
            # The policy becomes eligible for application starting on 
            # Feb 20, 2024 at 4:04 PM EST.
            # Optional.
            start: "2024-02-20T16:04:04-05:00"
            # The policy becomes ineligible for application on 
            # Feb 23, 2024 at 4:04 PM EST.
            # Optional.
            end: "2024-02-23T16:04:04-05:00" # optional
          activePeriod:
            # Timezone to be used for the startCron field.
            # Optional.
            timezone: "America/New_York"
            # Start applying the bufferSize everyday at 1:00  AM EST.
            # (Only eligible starting on Feb 20, 2024 at 4:04 PM.)
            # Optional.
            startCron: "0 1 * * 0"
            # Only apply the bufferSize for this 5 hours
            # Optional.
            duration: "5h"
        # Policy to be applied when the condition is met.
        # Required.
        policy:
          type: Buffer
          buffer:
            bufferSize: 50
            minReplicas: 100
            maxReplicas: 2000
        # Id of chain entry.
        # Optional.
      - id: "weekend" 
        type: Schedule
        schedule:
          between:
            # The policy becomes eligible for application starting on
            # Feb 24, 2024 at 4:05 PM EST.
            # Optional.
            start: "2024-02-24T16:04:05-05:00"
            # The policy becomes ineligible for application starting on
            # Feb 26, 2024 at 4:05 PM EST.
            # Optional.
            end: "2024-02-26T16:04:05-05:00"
          activePeriod:
            # Timezone to be used for the schedule.
            timezone: "America/New_York"
            # Start applying the bufferSize everyday at 1:00  AM EST.
            # (Only eligible starting on Feb 24, 2024 at 4:05 PM EST)
            # Optional.
            startCron: "0 1 * * 0"
            # Only apply the bufferSize for this 7 hours
            # Optional.
            duration: "7h"
        # Policy to be applied when the condition is met.
        # Required.
        policy:
          type: Counter
          counter:
            key: rooms
            bufferSize: 10
            minCapacity: 500
            maxCapacity: 1000
        # Id of chain entry.
      - id: "default"
        # Policy will always be applied when no other policy is applicable.
        # Required.
        policy:
          type: Buffer
          buffer:
            bufferSize: 5
            minReplicas: 100
            maxReplicas: 2000
zmerlynn commented 5 months ago

Design LGTM! A couple of nits:


It is recommended to use ISO8601 time format if you would like to specify a timezone. If a timezone is specified and RFC3339 format is used, the formatted string will take precedence if the timezones differ.

I would be explicit and use the code formatting to help guide the reader here: e.g. "It is recommended to use ISO8601 time format without a time zone if you would like to specify a timezone using .timezone. If .timezone is specified and .between.start or .between.end includes a timezone as well, the formatted string will take precedence if the timezones differ." Note that ISO8601 can include a timezone, so it's one reason I'm being pedantic here.


Schedule a chain entry can have a schedule (optional) contains a:

This section seems redundant with the definition of the schedule above in the design, maybe drop it or shorthand it more?

markmandel commented 5 months ago

Evaluation Time Window (between): This uses start and end datetimes that must conform to RFC3339 or ISO8601 to define time range. The policy application window will only be evaluated within this window. (e.g. Start evaluating the policy applications window 6 months from now and stop 12 months from now). It is recommended to use ISO8601 time format without a time zone if you would like to specify a timezone using .timezone. If .timezone is specified and .between.start or .between.end includes a timezone as well, the formatted string will take precedence if the timezones differ.

Rather than precedence - could we fail validation if a user provides both? Basically you could do one or the other, but not both?

Policy Application Window (activePeriod)

I'm assuming activePeriod is optional if a between is not specified - and will default to always essentially?

e.g. for CUJ No. 1 "E.g. On the 16th of January, 2025 Make the buffer size 20%, instead of the default 10%" - there's no need for a activePeriod.

indexjoseph commented 5 months ago

Rather than precedence - could we fail validation if a user provides both? Basically you could do one or the other, but not both?

Yeah, I like that, so if a user provides a .timezone and a start/end time with a timezone, validation fails. We can do the same w/ CRON_TZ/TZ, if the user decides to specify a TZ for the .activePeriod.startCron and it differs from the .timezone, validation fails.

I'm assuming activePeriod is optional if a between is not specified - and will default to always essentially?

Yes, exactly. If the user really wanted to they could set the .activePeriod.startCron to " *" and leave the duration empty, which would have the same effect as well.

igooch commented 3 months ago

@indexjoseph @zmerlynn are there any outstanding items, or can we mark this as complete?