googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.11k stars 817 forks source link

Adding GameServerSet Metric #3663

Closed Reasonably closed 5 days ago

Reasonably commented 9 months ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Additional context Add any other context or screenshots about the feature request here.

Kalaiselvi84 commented 9 months ago

@Reasonably, We're looking into adding this feature in a way that won't make things more expensive or complex. We're also thinking about showing updates at the fleet level. Can you please check if this requirement is similar to https://github.com/googleforgames/agones/issues/2817?

cc: @markmandel @roberthbailey

markmandel commented 9 months ago

Good questions @Kalaiselvi84

To that point, if we had metrics for each GameServerSet we hit cardinality explosion, and that would be bad - but to the point of what you want, some metric based on Fleet scale out guage metric would probably also work?

Reasonably commented 8 months ago

@Kalaiselvi84 Thank you for sharing a good feature, which is similar to my requirements.

However, I wanted to understand the situation more precisely.
For example, when you update the game version of the fleets that are serving the game, it is possible to determine if the update of those fleets is complete with that feature.
However, if the update strategy is currently set to an inadequate value (small surge, unavailable) for incoming allocate requests, the update may take longer than scheduled. In this case, simply whether or not the update is complete is not enough. So if I can visually see how gameserverset's allocated, ready, and desired have changed, it will greatly help me find the right strategy.

@markmandel
The reason I wrote this issue is because I expected the cardinality would not be high. gameserverset essentially takes a form very similar to fleet, and fleet is already exposing metrics.
When a fleet is updated, a new gameserverset is created and the existing gameserverset will be disappeared , so I anticipated that the cardinality would increase linearly. And as I know, there is no labels about gss in fleet gauge metric. Is there anything I might be misunderstanding?

markmandel commented 8 months ago

For example, when you update the game version of the fleets that are serving the game, it is possible to determine if the update of those fleets is complete with that feature.

I would suggest reading the ticket in it's entirety - what "complete" means with a rollout can be tricky, so we outlined a few use cases with that ticket.

But what I'm hearing here is - exposing rollout state specifically through metrics is not a requirement of this ticket?

@markmandel The reason I �wrote this issue is because I expected the cardinality would not be high. gameserverset essentially takes a form very similar to fleet, and fleet is already exposing metrics. When a fleet is updated, a new gameserverset is created and the existing gameserverset will be disappeared , so I anticipated that the cardinality would increase linearly. And as I know, there is no labels about gss in fleet gauge metric. Is there anything I might be misunderstanding?

Yes, but metric labels don't necessarily go away immediately from the metric storage - and Fleet can be relatively high cardiality (depending on how you use it), so n number of GameServerSets per Fleet, whenever you do an update makes it increase 2 times for each update -- so it can be a lot. It seems like a decent risk, when it's likely we can resolve this in a different way.

Reasonably commented 8 months ago

@markmandel I already read the ticket entirely. The requirement of this ticket was to have the desired, current, and allocated statuses of gameserversets created during the rollout process reported as gauge metrics.
If feature #2817 is applied and the current state of the fleet is exposed through metrics, it would be possible to determine the completion of the rollout through these metrics. However, this approach would not enable detailed analysis after the rollout in the event of delays during the rollout process.

I thought it would not be a sigificant issue to have gss metric that have cardinality of n times the number of fleets. Because some user may find 5 fleets sufficient, while others might use up to 100 fleets.
And each user may have different strategies for version updates and allocations. So, I think that being able to visually see the update process after it has been completed is considered a significant advantage. What about optionally exposing the GSS metrics?

markmandel commented 4 months ago

We were chatting about this offline, and wanted to come back to it - I'm wondering if we have a metrics that was more of a gauge of agones_fleets_gamserverset_count with the label of the Fleet name.

Then we don't hit cardinality explosion, but if you look at the number, you know if it's 1 there's no rollout, and if it's > 1 you know it's still rolling out / new versions are in flight.

WDYT? Seems like it would be useful!

markmandel commented 3 months ago

I'm also wondering if we could instead / additionally actually expose a percentage (agones_fleet_rollout_percent ?) of rollout that has happened to the active GameServerSet over the Fleet - and do that as a Gauge with a Fleet label. I think this would be doable and probably more inline with what you are looking for.

Does that make more sense?

markmandel commented 3 months ago

@ashutosji have a look and see what you think,

Some resources to look at:

https://github.com/googleforgames/agones/tree/main/pkg/metrics (this is where we calculate our aggregate metrics)

Grab all GameServerSets for a Fleet https://github.com/googleforgames/agones/blob/96c3d26857deebd05917260501f7ac74094ba461/pkg/fleets/fleets.go#L29-L29

How we get which GameServerSet is "active":

https://github.com/googleforgames/agones/blob/96c3d26857deebd05917260501f7ac74094ba461/pkg/fleets/controller.go#L298-L304

https://github.com/googleforgames/agones/blob/96c3d26857deebd05917260501f7ac74094ba461/pkg/fleets/controller.go#L719

So the logic would be - create a percentage value (integer) based on the active GameServerSet's Current count vs the Fleet's desired count (I think that math works).

github-actions[bot] commented 3 weeks ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

Reasonably commented 1 week ago

I completely forgot about this issue. Adding the agones_fleet_rollout_percent metric would generally be very helpful in detecting delays in rolling updates. Thank you!

However, I still think having something like gss_replica_count could be useful for identifying the root cause of these rollout delays. If cardinality is a concern, what about making this an optional?