topfreegames / maestro

Maestro: Game Room Scheduler
MIT License
115 stars 15 forks source link

refactor(healthcontroller): generate only one add/remove operation per cycle #636

Closed hspedro closed 2 months ago

hspedro commented 3 months ago

Currently, the health controller is performing both rolling updates and autoscaling. We can roughly divide these responsibilities among these functions executed by the operation:

  1. performRollingUpdate: generates a pair of add and remove operations based on rolling update constraints (maxSurge and maxUnavailable = 0)
  2. ensureDesiredAmountOfInstances: generates either an add or remove operation to move the current state equal to the desired

In a lot of cases, one cycle will generate conflicting operations: autoscale will try to remove X amount of rooms but rolling update only wants Y amount of rooms deleted. Moreover, autoscale deletion is based on its policy (currently only occupancy) but rolling update wants to delete based on the version of the scheduler. Thus, to streamline and simplify this process the refactor aims to split the responsibility and let only one piece of logic handle the creation or removal of rooms.

The work of this PR can be summarized in:

1. Split the logic, inside healthcontroller/execute.go

So that we have one logical abstraction to compute the desired number of rooms based on autoscaling and rolling update (if ongoing), and another logical abstraction responsible for fixing the drift between current and desired by enqueuing add or remove operations

All logic that is responsible for computing the desired amount can be refactored to a different operation, outside of execution worker, later on. All those methods are not receivers of the Execute and the dependency is injected by their arguments in the functions:

2. Refactor the Remove Rooms operation to sort by version when deleting a number of game rooms

Currently, the system will sort by game room status, however, we want to take the sorted rooms and apply a sort by version on Pending, Ready, and Occupied rooms.

Given this set of rooms with their status + versions: [Error1v1, Error2v2, Pending1v1, Pending2v2, Ready1v1, Ready2v2, Occupied1v1, Occupied2v2]

Currently, the sort by priority/status will return the above list, however, we want to ensure old scheduler versions (v1) are deleted first (Kubernetes deletes pods based on creation time, so it will be similar). Thus, the new sorting version will become:

If there's only one active version of the scheduler in the system, then we skip the additional sort.

3. Simplified calculations of desired when autoscaling + rolling updating

As a baseline, the system will always use the desired from autoscaling (or roomsReplica if no autoscale is defined). However, if we have a major version of the difference between schedulers of current rooms in the system, then we are in a rolling update and we should consider the maxSurge.

Maestro Rollout Strategy Design Doc

4. Add unit tests with ALL rolling update cycles

Create a matrix of test cases with table tests, validated in spreadsheets and with the desired behavior that each health_controller operation should take during a rolling update. There are 4 base cases: stable occupancy, occupancy increased, occupancy decreased, and stable occupancy but only 75% of new rooms becoming active.

Spreadsheet reference with the desired cases: https://docs.google.com/spreadsheets/d/13grNuk3os7mOesnpROqLOVTSWiGpgvi5JUgWUB3Z428/edit?usp=sharing

codecov-commenter commented 3 months ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 64.15%. Comparing base (bb118c8) to head (404be51). Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
internal/adapters/runtime/kubernetes/game_room.go 60.00% 3 Missing and 1 partial :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #636 +/- ## ========================================== - Coverage 64.30% 64.15% -0.15% ========================================== Files 39 39 Lines 2905 2960 +55 ========================================== + Hits 1868 1899 +31 - Misses 909 929 +20 - Partials 128 132 +4 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.