Closed hspedro closed 2 months ago
:warning: Please install the to ensure uploads and comments are reliably processed by Codecov.
Attention: Patch coverage is 60.00000%
with 4 lines
in your changes missing coverage. Please review.
Project coverage is 64.15%. Comparing base (
bb118c8
) to head (404be51
). Report is 14 commits behind head on main.
Files with missing lines | Patch % | Lines |
---|---|---|
internal/adapters/runtime/kubernetes/game_room.go | 60.00% | 3 Missing and 1 partial :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Currently, the health controller is performing both rolling updates and autoscaling. We can roughly divide these responsibilities among these functions executed by the operation:
In a lot of cases, one cycle will generate conflicting operations: autoscale will try to remove X amount of rooms but rolling update only wants Y amount of rooms deleted. Moreover, autoscale deletion is based on its policy (currently only occupancy) but rolling update wants to delete based on the version of the scheduler. Thus, to streamline and simplify this process the refactor aims to split the responsibility and let only one piece of logic handle the creation or removal of rooms.
The work of this PR can be summarized in:
1. Split the logic, inside
healthcontroller/execute.go
So that we have one logical abstraction to compute the desired number of rooms based on autoscaling and rolling update (if ongoing), and another logical abstraction responsible for fixing the drift between current and desired by enqueuing add or remove operations
All logic that is responsible for computing the desired amount can be refactored to a different operation, outside of execution worker, later on. All those methods are not receivers of the
Execute
and the dependency is injected by their arguments in the functions:2. Refactor the Remove Rooms operation to sort by version when deleting a number of game rooms
Currently, the system will sort by game room status, however, we want to take the sorted rooms and apply a sort by version on Pending, Ready, and Occupied rooms.
Given this set of rooms with their status + versions: [Error1v1, Error2v2, Pending1v1, Pending2v2, Ready1v1, Ready2v2, Occupied1v1, Occupied2v2]
Currently, the sort by priority/status will return the above list, however, we want to ensure old scheduler versions (v1) are deleted first (Kubernetes deletes pods based on creation time, so it will be similar). Thus, the new sorting version will become:
If there's only one active version of the scheduler in the system, then we skip the additional sort.
3. Simplified calculations of desired when autoscaling + rolling updating
As a baseline, the system will always use the desired from autoscaling (or roomsReplica if no autoscale is defined). However, if we have a major version of the difference between schedulers of current rooms in the system, then we are in a rolling update and we should consider the
maxSurge
.4. Add unit tests with ALL rolling update cycles
Create a matrix of test cases with table tests, validated in spreadsheets and with the desired behavior that each health_controller operation should take during a rolling update. There are 4 base cases: stable occupancy, occupancy increased, occupancy decreased, and stable occupancy but only 75% of new rooms becoming active.
Spreadsheet reference with the desired cases: https://docs.google.com/spreadsheets/d/13grNuk3os7mOesnpROqLOVTSWiGpgvi5JUgWUB3Z428/edit?usp=sharing