openkruise / kruise-game

Game Servers Management on Kubernetes
https://openkruise.io/kruisegame/introduction
Apache License 2.0
233 stars 38 forks source link

Occasional "NotReady" Network Status on Pod Upon Rebuilding a GameServerSet #126

Closed alvin-7 closed 2 days ago

alvin-7 commented 7 months ago

We are experiencing an issue where, upon updating our GameServerSet (GSS), which causes all managed Pods to rebuild, there's an occurrence of Pods (out of the 6 running GameServers) ending up with a failure in retrieving network information, resulting in a "NotReady" network status. Below are the specific details and steps that lead to this issue:

Environment:

Network Plugin: HostPort Number of GameServer replica in the GSS: 6

Steps to Reproduce:

  1. Update the GSS by changing the container image and environment variables. This action triggers a rebuild of all Pods managed by the GSS.
  2. After the old Pods are deleted and new ones are recreated, one of the six Pods encounters an error in obtaining network information.

Expected Behavior:

After the update and subsequent Pod recreation, all Pods should successfully retrieve their network information and display a "Ready" network status.

Log informantion

I am observing logs from kruise-game-manager that warrant attention. Here are the specific log entries:

2024-01-26T14:59:46+08:00 I0126 06:59:46.237778       1 hostPort.go:73] Receiving pod dev/gs-dev-a4-3 ADD Operation
2024-01-26T14:59:46+08:00 I0126 06:59:46.237840       1 hostPort.go:80] There is a pod with same ns/name(dev/gs-dev-a4-3) exists in cluster, do not allocate
chrisliu1995 commented 7 months ago

When pod recreate, network plugin(Webhook) receive both DELETE and ADD Operation. However, old pod was still in the cluster, so ADD Operation will be failed, Webhook would not patch the ports on pod.

We plan to refract the Webhook Mutating machinism for network plugin. Here the plan: When Webhook get plugin error, it will deny the request, request will be created util no error generated.

chrisliu1995 commented 2 days ago

Newest Version fixed that.