Closed KamiMay closed 4 years ago
I have a strong feeling this is because if a pod get's deleted, the backing GameServer is left in a zombie state (i.e. not deleted along with it).
We should implement functionality to say that if Pod get's removed, the owning GameServer should be deleted too. This should solve this issue.
Actually - I'm not sure what this is - I tested deleting the backing Pod from a GameServer, and the GameServer gets deleted. More investigation required!
I think best way to recreate it is to follow what I have done. Because it has recreated one of the fleets but not the other two. So it might be a random factor in there. Would suggest trying it with a few different fleets. It seems to be random. Sometimes happens sometimes doesn't.
I will try reproduce this issue today and provide all the relevant details along the way
Before migrating to a new node pool:
kubectl describe fleets
I get following:Name: deathmatch-server
Namespace: default
Labels: <none>
Annotations: gameMode=deathmatch
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"deathmatch"},"name":"deathmatch-server","namespace":"d...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-16T17:26:14Z
Generation: 1
Resource Version: 9493539
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/deathmatch-server
UID: b7fd3272-e9c4-11e8-b175-42010a8400fd
Spec:
Replicas: 1
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Labels:
Game Mode: deathmatch
Spec:
Health:
Ports:
Container Port: 4444
Name: default
Port Policy: dynamic
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: game-server:0.3.0.4
Name: deathmatch-server
Resources:
Requests:
Cpu: 300m
Status:
Allocated Replicas: 0
Ready Replicas: 1
Replicas: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingGameServerSet 6m fleet-controller Created GameServerSet deathmatch-server-ktjv8
Name: endless-server
Namespace: default
Labels: <none>
Annotations: gameMode=endless
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"endless"},"name":"endless-server","namespace":"default...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-16T17:24:08Z
Generation: 1
Resource Version: 9493228
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/endless-server
UID: 6cf44e7c-e9c4-11e8-b175-42010a8400fd
Spec:
Replicas: 1
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Labels:
Game Mode: endless
Spec:
Health:
Ports:
Container Port: 4444
Name: default
Port Policy: dynamic
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: game-server:0.3.0.4
Name: endless-server
Resources:
Requests:
Cpu: 300m
Status:
Allocated Replicas: 0
Ready Replicas: 1
Replicas: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingGameServerSet 8m fleet-controller Created GameServerSet endless-server-vjb45
Name: royale-server
Namespace: default
Labels: <none>
Annotations: gameMode=royale
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"royale"},"name":"royale-server","namespace":"default"}...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-16T17:26:18Z
Generation: 1
Resource Version: 9493571
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/royale-server
UID: baad690c-e9c4-11e8-b175-42010a8400fd
Spec:
Replicas: 1
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Labels:
Game Mode: royale
Spec:
Health:
Ports:
Container Port: 4444
Name: default
Port Policy: dynamic
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: game-server:0.3.0.4
Name: royale-server
Resources:
Requests:
Cpu: 300m
Status:
Allocated Replicas: 0
Ready Replicas: 1
Replicas: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingGameServerSet 6m fleet-controller Created GameServerSet royale-server-28ft9
kubectl describe fleets
insists I have all of them onlineName: deathmatch-server
Namespace: default
Labels: <none>
Annotations: gameMode=deathmatch
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"deathmatch"},"name":"deathmatch-server","namespace":"d...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-16T17:26:14Z
Generation: 1
Resource Version: 9495633
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/deathmatch-server
UID: b7fd3272-e9c4-11e8-b175-42010a8400fd
Spec:
Replicas: 1
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Labels:
Game Mode: deathmatch
Spec:
Health:
Ports:
Container Port: 4444
Name: default
Port Policy: dynamic
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: game-server:0.3.0.4
Name: deathmatch-server
Resources:
Requests:
Cpu: 300m
Status:
Allocated Replicas: 0
Ready Replicas: 1
Replicas: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingGameServerSet 14m fleet-controller Created GameServerSet deathmatch-server-ktjv8
Name: endless-server
Namespace: default
Labels: <none>
Annotations: gameMode=endless
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"endless"},"name":"endless-server","namespace":"default...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-16T17:24:08Z
Generation: 1
Resource Version: 9493228
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/endless-server
UID: 6cf44e7c-e9c4-11e8-b175-42010a8400fd
Spec:
Replicas: 1
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Labels:
Game Mode: endless
Spec:
Health:
Ports:
Container Port: 4444
Name: default
Port Policy: dynamic
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: game-server:0.3.0.4
Name: endless-server
Resources:
Requests:
Cpu: 300m
Status:
Allocated Replicas: 0
Ready Replicas: 1
Replicas: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingGameServerSet 16m fleet-controller Created GameServerSet endless-server-vjb45
Name: royale-server
Namespace: default
Labels: <none>
Annotations: gameMode=royale
kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"royale"},"name":"royale-server","namespace":"default"}...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Cluster Name:
Creation Timestamp: 2018-11-16T17:26:18Z
Generation: 1
Resource Version: 9493571
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/royale-server
UID: baad690c-e9c4-11e8-b175-42010a8400fd
Spec:
Replicas: 1
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Labels:
Game Mode: royale
Spec:
Health:
Ports:
Container Port: 4444
Name: default
Port Policy: dynamic
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: game-server:0.3.0.4
Name: royale-server
Resources:
Requests:
Cpu: 300m
Status:
Allocated Replicas: 0
Ready Replicas: 1
Replicas: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal CreatingGameServerSet 14m fleet-controller Created GameServerSet royale-server-28ft9
I was able to reproduce the issue on GKE. On the second attempt only. At first I switched from 4 nodes node-pool to 3 nodes new pool - all pods remains the same, after second attempt I switched to 3 nodes new pool and delete old and now the output of kubectl get pods
and kubectl get gs
differ.
Also note that after switching to new node pool I can allocate the server but cannot connect to GS using nc -u
. It seems that IP and port contains information from previous node pool.
:build$ kubectl get pods
NAME READY STATUS RESTARTS AGE
fleet-example-s99fw-6vvp9-n25rv 2/2 Running 0 10m
fleet-example-s99fw-95n9f-z57ph 2/2 Running 0 10m
simple-udp3-5ng8z-xvw87-shkkj 2/2 Running 0 1h
simple-udp32-8mps7-bx8rr-5mwtd 2/2 Running 0 1h
simple-udp32-8mps7-j8ffc-fdpzs 2/2 Running 0 1h
simple-udp322-phvk2-5t6x4-fmh7q 2/2 Running 0 1h
:build$ kubectl get gs
NAME STATE ADDRESS PORT NODE AGE
fleet-example-s99fw-6vvp9 Ready 35.247.112.202 7704 gke-test-cluster-pool-2-ab46da87-6qxw 11m
fleet-example-s99fw-95n9f Ready 35.247.112.202 7938 gke-test-cluster-pool-2-ab46da87-6qxw 10m
simple-udp3-5ng8z-4nztc Ready 35.247.88.114 7026 gke-test-cluster-default-4b096bd5-vt6d 1h
simple-udp3-5ng8z-9k2fp Ready 35.247.88.114 7769 gke-test-cluster-default-4b096bd5-vt6d 1h
simple-udp3-5ng8z-9m82d Ready 104.196.235.107 7794 gke-test-cluster-pool-1-f618dd8c-mkw0 1h
simple-udp3-5ng8z-mwwrd Ready 35.247.88.114 7283 gke-test-cluster-default-4b096bd5-vt6d 1h
simple-udp3-5ng8z-xvw87 Ready 35.247.88.114 7762 gke-test-cluster-pool-2-ab46da87-6qfz 1h
simple-udp32-8mps7-8ng9j Ready 104.196.235.107 7143 gke-test-cluster-pool-1-f618dd8c-mkw0 1h
simple-udp32-8mps7-bx8rr Ready 35.247.112.202 7832 gke-test-cluster-pool-2-ab46da87-6qxw 1h
simple-udp32-8mps7-gc22g Ready 35.247.88.114 7840 gke-test-cluster-default-4b096bd5-vt6d 1h
simple-udp32-8mps7-j8ffc Ready 35.247.88.114 7516 gke-test-cluster-pool-2-ab46da87-6qfz 1h
simple-udp32-8mps7-p49xz Ready 35.247.7.156 7010 gke-test-cluster-pool-1-f618dd8c-5d0t 1h
simple-udp322-phvk2-5t6x4 Ready 35.247.88.114 7631 gke-test-cluster-pool-2-ab46da87-6qfz 1h
No events in simple-udp3 fleet and 5 Current Replicas:
:build$ kubectl describe fleet simple-udp3
Name: simple-udp3
Namespace: default
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{},"name":"simple-udp3","namespace":"default"},"spec":{"replicas":5...
API Version: stable.agones.dev/v1alpha1
Kind: Fleet
Metadata:
Creation Timestamp: 2019-01-18T09:45:43Z
Generation: 1
Resource Version: 176561
Self Link: /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/simple-udp3
UID: d28f4018-1b05-11e9-b6e3-42010a8a002f
Spec:
Replicas: 5
Scheduling: Packed
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Type: RollingUpdate
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Health:
Ports:
Container Port: 7654
Name: default
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Image: gcr.io/agones-images/udp-server:0.5
Name: simple-udp
Resources:
Status:
Allocated Replicas: 0
Ready Replicas: 5
Replicas: 5
Events: <none>
However in pods list there exist only one record for this fleet not five:
simple-udp3-5ng8z-xvw87-shkkj 2/2 Running 0 1h
As you can see above from the output of kubectl get gs
the NODE field shows that pods belongs to different node pools: gke-test-cluster-default, gke-test-cluster-pool-2, gke-test-cluster-pool-1. However only pool-2 was running at that moment.
Also I noticed that ADDRESS and PORT for some Pods was not changed after deleting the node pool.
So it seems if you delete a nodePool, then the Pods still exist inside Kubernetes?
I'm starting to think this might be a GKE bug!
Found such error messages on agones-controller: 2019-01-23 05:38:06.539 UTC-8 error creating gameserver for gameserverset fleet-example-x24zx: Internal error occurred: failed calling admission webhook "mutations.stable.agones.dev": Post https://agones-controller-service.agones-system.svc:443/mutate?timeout=30s: no endpoints available for service "agones-controller-service" which appears on calls to gameserversets.(Controller).syncMoreGameServers() and (Controller).syncGameServerSetState()
Also note that after switching node_pool new nodes should be added to gameserver firewall game-server-firewall
manually.
ooh, I wonder if it's because the agones controller is being taken down -- and that means the webhook can't be fired - which may not be something we can actually fix? :confused:
2 thoughts for next steps:
Add some logging here: https://github.com/GoogleCloudPlatform/agones/blob/master/pkg/gameservers/controller.go#L158
And see if the Pod deletion event gets fired when you switch the nodepools. I'm wondering if they don't, and that's what is causing the issue.
2) Setup the Agones controller to run on it's own nodepool, and then switch out the nodepool for the gameservers, and see if it happens there.
I'm wondering if there controller not being removed by the nodepool solves the issue (at least partially) - or at least provides a document-able workaround.
I will reproduce this on the latest master with these 2 steps
In the gameservers controller code:
if oldPod.Spec.NodeName != newPod.Spec.NodeName {
This condition works only on Pod creating, it fires two time - one on first pool, and on second. Note that oldPod.Spec.NodeName always empty.
After node pool switch all gs restarted, but game-server-firewall
setting is missing, so nc -u
is not working on gs after node switch.
Another issues is during the process the connection to kubectl
was refused:
kubectl get gs
No resources found.
The connection to the server 35.197.87.248 was refused - did you specify the right host or port?
However nc -u
is working after node pool switch if run from Node itself:
gcloud ssh ...
toolbox
nc -u 127.0.0.1 7424
After node pool switch all gs restarted, but
game-server-firewall
setting is missing, sonc -u
is not working on gs after node switch.
A new nodepool will need to be told to have the firewall tag, it won't be included automatically - so I don't think that that part is a bug.
So apart from that, does it work?
@markmandel With separation of node_pools "agones-system" and "default" fleets are restarted well. If we add new node_pool and then delete "default", all GS got restarted on new nodes. I think this bug does not include "agones-system" node pool restart.
@aLekSer - based on your last update, it sounds like Mark's guess above is likely correct: the problem occurs when the agones controller is down.
What I don't understand is why it wouldn't fix itself once the controller came back up. With a level triggered system (see thockin's nice presentation here) it shouldn't be an issue if a single "event" is missed; the controller should look at the current state when it comes up and make it match the desired state.
@roberthbailey Not quite sure about the root cause for now.
Now that #1008 is written, I think we can close this, as we give advice on how to perform upgrades that mitigate this issue (what seems to mostly be a race condition).
Also, the advice to setup separate node pools in production also seems to resolve it.
I've noticed something weird today. I needed to swap node pool in GKE so I created new node pool and deleted old one. I expected all instances in the old node pool to recover in the new one after some time. However in my particular case I could only see 1 of the 3 servers in
workloads
page on GCloud. So I checked fleets to see if it has min availability which was 1 of each kind = 3. Andkubectl describe fleets
indicated that 3 servers were online and available, however when I tried to connect to one that was listed but not inworkloads
it failed to connect to it, I was able to connect to the one appearing in theworkloads
, but not others. I had to delete fleets and recreate them for them to appear and work correctly again.