openkruise / kruise-game

Game Servers Management on Kubernetes
https://openkruise.io/kruisegame/introduction
Apache License 2.0
233 stars 38 forks source link

[proposal] Elegant update and offline of GameServers #161

Closed chrisliu1995 closed 1 day ago

chrisliu1995 commented 2 months ago

Background

Game servers, due to their strong stateful characteristics, have a high demand for graceful shutdown operations. A game server typically needs to wait until data is fully persisted to disk and ensured to be safe before it can be thoroughly removed. Although Kubernetes natively provides the preStop hook, which allows containers to execute specific actions before they are about to shut down, there is a limitation: once the preset time limit is exceeded, the container will have to be forcibly terminated, regardless of whether the data processing is complete or not. In some cases, this approach lacks real gracefulness. We need a more flexible mechanism to ensure that game servers can exit smoothly while protecting all critical states.

OpenKruise has introduced the Lifecycle Hook feature, which provides precise control and waiting mechanisms for game servers at critical lifecycle moments. This allows servers to execute the actual deletion or update operations only after meeting specific conditions. By providing a configurable Lifecycle field, combined with the ability to customize service quality, OKG ensures that the game server's shutdown process is both graceful and reliable. With this advanced feature, maintainers can ensure that all necessary data persistence and internal state synchronization are safely and correctly completed before the server is smoothly removed or updated.

游戏服务器强状态的关键特性使它们对于优雅的下线操作有很高的需求。一个游戏服务器通常需要等待数据被完全持久化到磁盘上并确保安全后,才能进行彻底的移除。虽然Kubernetes原生提供了preStop钩子,允许容器在即将关闭前执行特定操作,但存在一个局限性:一旦超出了预设的时间限制,容器将不得不被强制终止,不管数据处理是否完成。在某些情况下,这种方法缺乏真正的优雅性。我们需要一个更灵活的机制来确保游戏服务器能够在保护了所有关键状态的前提下平滑地退出。

OpenKruise 引入了 Lifecycle Hook 功能,为游戏服务器提供了在关键生命周期节点上的精确控制和等待机制。这使得服务器能失在满足特定条件后,方才执行真正的删除或更新操作。通过提供可配置的 Lifecycle 字段,并结合自定义服务质量的能力,OKG 能够确保游戏服务器的下线过程既优雅又可靠。借助这一进阶特性,维护者可以确保所有必要的数据持久化和内部状态同步在安全无误地完成后,服务器才会被平稳地移除或更新。

Example

(This example will not be runt successfully, because lifecycle has not be exposed yet.)

apiVersion: game.kruise.io/v1alpha1
kind: GameServerSet
metadata:
  name: minecraft
  namespace: default
spec:
  replicas: 3
  lifecycle:
    preDelete:
      labelsHandler:
        gs-sync/delete-block: "true"
  gameServerTemplate:
    metadata:
      labels:
        gs-sync/delete-block: "true"
    spec:
      containers:
        - image: registry.cn-beijing.aliyuncs.com/chrisliu95/minecraft-demo:probe-v0
          name: minecraft
          volumeMounts:
            - name: gsState
              mountPath: /etc/gsinfo
      volumes:
        - name: gsinfo
          downwardAPI:
            items:
              - path: "state"
                fieldRef:
                  fieldPath: metadata.labels['game.kruise.io/gs-state']
  serviceQualities:
    - name: healthy
      containerName: minecraft
      permanent: false
      exec:
        command: ["bash", "./probe.sh"]
      serviceQualityAction:
        - state: true
          result: done
          labels:
            gs-sync/delete-block: "false"
        - state: true
          result: WaitToBeDeleted
          opsState: WaitToBeDeleted
        - state: false
          opsState: None

The corresponding script is as follows. The script performs the following actions:

对应的脚本如下。该脚本做了以下动作:

#!/bin/bash

file_path="/etc/gsinfo/state"
data_flushed_file="/etc/gsinfo/data_flushed"

if [[ ! -f "$file_path" ]]; then
    exit 0
fi

state_content=$(cat "$file_path")

if [[ "$state_content" == "PreDelete" ]]; then
    if [[ -f "$data_flushed_file" ]]; then
        echo "done"
        exit 1
    else
        touch "$data_flushed_file"
        echo "WaitToBeDeleted"
        exit 1
    fi
else
    people_count_file="/etc/gsinfo/people_count"

    people_count=$(cat "$people_count_file")

    if [[ "$people_count" -eq 0 ]]; then
        echo "WaitToBeDeleted"
        exit 1
    else
        exit 0
    fi
fi

image

The process of elegant delete as follow:

  1. The game server is running normally, and the number of players is not 0.
  2. When the number of players drops to 0, set the opsState to WaitToBeDeleted using custom service quality settings.
  3. Through the automatic scaling policy, OKG deletes the GameServer with WaitToBeDeleted opsState. Since the lifecycle hook is configured and the delete-block label wil be set to true, the gs is not truly deleted but enters the PreDelete state, and the data flushing process is triggered by custom service quality.
  4. Once data flushing is complete, set the delete-block label to false using custom service quality to release the checkpoint.
  5. After the checkpoint is released, the PreDelete phase moves into the Delete phase. The gs is then truly deleted.

优雅下线的过程如下:

  1. 游戏服正常运行,玩家数量不为0
  2. 当玩家数量为0,通过自定义服务质量设置opsState为WaitToBeDeleted
  3. 通过自动缩容策略,OKG将该GameServer删除。由于配置了lifecycle hook,delete-block 标签为 true,gs不会真正被删除,而进入PreDelete状态,并通过自定义服务质量触发数据落盘过程。
  4. 当数据完成落盘,通过自定义服质量将delete-block标签设为false,卡点解除。
  5. 卡点解除后,PreDelete阶段将进入Delete阶段。gs真正被删除。

TODO

  1. Expose the Lifecycle field in GameServerSert.Spec.
  2. Ad "PreDelete" / "PreUpdate" runtime states for GameServer.
ashish111333 commented 2 weeks ago

@chrisliu1995 @ringtail 我应该使用 ConfigMap 来存储这个脚本吗?或者我还有其他一些选择

chrisliu1995 commented 2 weeks ago

@chrisliu1995 @ringtail 我应该使用 ConfigMap 来存储这个脚本吗?或者我还有其他一些选择

The shell script is a file in container. OKG will be call it periodically. You can check this function in https://openkruise.io/kruisegame/user-manuals/service-qualities/

ashish111333 commented 2 weeks ago

@chrisliu1995 @ringtail 我应该使用 ConfigMap 来存储这个脚本吗?或者我还有其他一些选择

The shell script is a file in container. OKG will be call it periodically. You can check this function in https://openkruise.io/kruisegame/user-manuals/service-qualities/

@chrisliu1995 So is this proposal complete? Is this resolved? I can make a Pr for this.

chrisliu1995 commented 2 weeks ago

@chrisliu1995 @ringtail 我应该使用 ConfigMap 来存储这个脚本吗?或者我还有其他一些选择

The shell script is a file in container. OKG will be call it periodically. You can check this function in https://openkruise.io/kruisegame/user-manuals/service-qualities/

@chrisliu1995 So is this proposal complete? Is this resolved? I can make a Pr for this.

It is completed, you can see the PR: https://github.com/openkruise/kruise-game/pull/162

I can see that you have a willing to contribute to our project. If there are new feats/enhancements to be done, I'll invite you to join. How about it?

ashish111333 commented 2 weeks ago

@chrisliu1995 @ringtail 我应该使用 ConfigMap 来存储这个脚本吗?或者我还有其他一些选择

The shell script is a file in container. OKG will be call it periodically. You can check this function in https://openkruise.io/kruisegame/user-manuals/service-qualities/

@chrisliu1995 So is this proposal complete? Is this resolved? I can make a Pr for this.

It is completed, you can see the PR: https://github.com/openkruise/kruise-game/pull/162

I can see that you have a willing to contribute to our project. If there are new feats/enhancements to be done, I'll invite you to join. How about it?

Thanks, yeah I wanted to contribute ... I have gone through kruise-game docs and blogs and also have the kruise-game cloned on my machine... But I was using a configMap to store the script, thanks you clarified,I was about to make a Pr.

Would be happy to contribute in any future issues and enhancements.