Moving cluster to a new node pool doesn't recreate all fleets

KamiMay commented 6 years ago

I've noticed something weird today. I needed to swap node pool in GKE so I created new node pool and deleted old one. I expected all instances in the old node pool to recover in the new one after some time. However in my particular case I could only see 1 of the 3 servers in workloads page on GCloud. So I checked fleets to see if it has min availability which was 1 of each kind = 3. And kubectl describe fleets indicated that 3 servers were online and available, however when I tried to connect to one that was listed but not in workloads it failed to connect to it, I was able to connect to the one appearing in the workloads, but not others. I had to delete fleets and recreate them for them to appear and work correctly again.

markmandel commented 6 years ago

I have a strong feeling this is because if a pod get's deleted, the backing GameServer is left in a zombie state (i.e. not deleted along with it).

We should implement functionality to say that if Pod get's removed, the owning GameServer should be deleted too. This should solve this issue.

markmandel commented 5 years ago

Actually - I'm not sure what this is - I tested deleting the backing Pod from a GameServer, and the GameServer gets deleted. More investigation required!

KamiMay commented 5 years ago

I think best way to recreate it is to follow what I have done. Because it has recreated one of the fleets but not the other two. So it might be a random factor in there. Would suggest trying it with a few different fleets. It seems to be random. Sometimes happens sometimes doesn't.

KamiMay commented 5 years ago

I will try reproduce this issue today and provide all the relevant details along the way

KamiMay commented 5 years ago

Before migrating to a new node pool:

I have 3 instances running. All of thaem are 1vCPU 3.75GB RAM
I have 3 fleets running, each of them have 1 replica set + agones-controller and matchmaker.
It looks like this on gcloud:
At this point when i run kubectl describe fleets I get following:

Name:         deathmatch-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=deathmatch
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"deathmatch"},"name":"deathmatch-server","namespace":"d...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:14Z
  Generation:          1
  Resource Version:    9493539
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/deathmatch-server
  UID:                 b7fd3272-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  deathmatch
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     deathmatch-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  6m    fleet-controller  Created GameServerSet deathmatch-server-ktjv8

Name:         endless-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=endless
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"endless"},"name":"endless-server","namespace":"default...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:24:08Z
  Generation:          1
  Resource Version:    9493228
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/endless-server
  UID:                 6cf44e7c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  endless
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     endless-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  8m    fleet-controller  Created GameServerSet endless-server-vjb45

Name:         royale-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=royale
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"royale"},"name":"royale-server","namespace":"default"}...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:18Z
  Generation:          1
  Resource Version:    9493571
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/royale-server
  UID:                 baad690c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  royale
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     royale-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  6m    fleet-controller  Created GameServerSet royale-server-28ft9

I've added new node pool to the cluster of the same spec as old node pool
Deleted old node pool
After deletion it looks like this on gcloud:
After deletion completed and migration happend I only observe two fleets recreated, however kubectl describe fleets insists I have all of them online

Name:         deathmatch-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=deathmatch
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"deathmatch"},"name":"deathmatch-server","namespace":"d...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:14Z
  Generation:          1
  Resource Version:    9495633
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/deathmatch-server
  UID:                 b7fd3272-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  deathmatch
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     deathmatch-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  14m   fleet-controller  Created GameServerSet deathmatch-server-ktjv8

Name:         endless-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=endless
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"endless"},"name":"endless-server","namespace":"default...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:24:08Z
  Generation:          1
  Resource Version:    9493228
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/endless-server
  UID:                 6cf44e7c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  endless
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     endless-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  16m   fleet-controller  Created GameServerSet endless-server-vjb45

Name:         royale-server
Namespace:    default
Labels:       <none>
Annotations:  gameMode=royale
              kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{"gameMode":"royale"},"name":"royale-server","namespace":"default"}...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-11-16T17:26:18Z
  Generation:          1
  Resource Version:    9493571
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/royale-server
  UID:                 baad690c-e9c4-11e8-b175-42010a8400fd
Spec:
  Replicas:  1
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
      Labels:
        Game Mode:  royale
    Spec:
      Health:
      Ports:
        Container Port:  4444
        Name:            default
        Port Policy:     dynamic
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:    game-server:0.3.0.4
            Name:     royale-server
            Resources:
              Requests:
                Cpu:  300m
Status:
  Allocated Replicas:  0
  Ready Replicas:      1
  Replicas:            1
Events:
  Type    Reason                 Age   From              Message
  ----    ------                 ----  ----              -------
  Normal  CreatingGameServerSet  14m   fleet-controller  Created GameServerSet royale-server-28ft9

aLekSer commented 5 years ago

I was able to reproduce the issue on GKE. On the second attempt only. At first I switched from 4 nodes node-pool to 3 nodes new pool - all pods remains the same, after second attempt I switched to 3 nodes new pool and delete old and now the output of kubectl get pods and kubectl get gs differ. Also note that after switching to new node pool I can allocate the server but cannot connect to GS using nc -u. It seems that IP and port contains information from previous node pool.

:build$ kubectl get pods
NAME                              READY     STATUS    RESTARTS   AGE
fleet-example-s99fw-6vvp9-n25rv   2/2       Running   0          10m
fleet-example-s99fw-95n9f-z57ph   2/2       Running   0          10m
simple-udp3-5ng8z-xvw87-shkkj     2/2       Running   0          1h
simple-udp32-8mps7-bx8rr-5mwtd    2/2       Running   0          1h
simple-udp32-8mps7-j8ffc-fdpzs    2/2       Running   0          1h
simple-udp322-phvk2-5t6x4-fmh7q   2/2       Running   0          1h
:build$ kubectl get gs
NAME                        STATE     ADDRESS           PORT      NODE                                     AGE
fleet-example-s99fw-6vvp9   Ready     35.247.112.202    7704      gke-test-cluster-pool-2-ab46da87-6qxw    11m
fleet-example-s99fw-95n9f   Ready     35.247.112.202    7938      gke-test-cluster-pool-2-ab46da87-6qxw    10m
simple-udp3-5ng8z-4nztc     Ready     35.247.88.114     7026      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp3-5ng8z-9k2fp     Ready     35.247.88.114     7769      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp3-5ng8z-9m82d     Ready     104.196.235.107   7794      gke-test-cluster-pool-1-f618dd8c-mkw0    1h
simple-udp3-5ng8z-mwwrd     Ready     35.247.88.114     7283      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp3-5ng8z-xvw87     Ready     35.247.88.114     7762      gke-test-cluster-pool-2-ab46da87-6qfz    1h
simple-udp32-8mps7-8ng9j    Ready     104.196.235.107   7143      gke-test-cluster-pool-1-f618dd8c-mkw0    1h
simple-udp32-8mps7-bx8rr    Ready     35.247.112.202    7832      gke-test-cluster-pool-2-ab46da87-6qxw    1h
simple-udp32-8mps7-gc22g    Ready     35.247.88.114     7840      gke-test-cluster-default-4b096bd5-vt6d   1h
simple-udp32-8mps7-j8ffc    Ready     35.247.88.114     7516      gke-test-cluster-pool-2-ab46da87-6qfz    1h
simple-udp32-8mps7-p49xz    Ready     35.247.7.156      7010      gke-test-cluster-pool-1-f618dd8c-5d0t    1h
simple-udp322-phvk2-5t6x4   Ready     35.247.88.114     7631      gke-test-cluster-pool-2-ab46da87-6qfz    1h

No events in simple-udp3 fleet and 5 Current Replicas:

:build$ kubectl describe fleet simple-udp3                                                                                                                                  
Name:         simple-udp3
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"stable.agones.dev/v1alpha1","kind":"Fleet","metadata":{"annotations":{},"name":"simple-udp3","namespace":"default"},"spec":{"replicas":5...
API Version:  stable.agones.dev/v1alpha1
Kind:         Fleet
Metadata:
  Creation Timestamp:  2019-01-18T09:45:43Z
  Generation:          1
  Resource Version:    176561
  Self Link:           /apis/stable.agones.dev/v1alpha1/namespaces/default/fleets/simple-udp3
  UID:                 d28f4018-1b05-11e9-b6e3-42010a8a002f
Spec:
  Replicas:    5
  Scheduling:  Packed
  Strategy:
    Rolling Update:
      Max Surge:        25%
      Max Unavailable:  25%
    Type:               RollingUpdate
  Template:
    Metadata:
      Creation Timestamp:  <nil>
    Spec:
      Health:
      Ports:
        Container Port:  7654
        Name:            default
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec:
          Containers:
            Image:  gcr.io/agones-images/udp-server:0.5
            Name:   simple-udp
            Resources:
Status:
  Allocated Replicas:  0
  Ready Replicas:      5
  Replicas:            5
Events:                <none>

However in pods list there exist only one record for this fleet not five:

simple-udp3-5ng8z-xvw87-shkkj     2/2       Running   0          1h

aLekSer commented 5 years ago

As you can see above from the output of kubectl get gs the NODE field shows that pods belongs to different node pools: gke-test-cluster-default, gke-test-cluster-pool-2, gke-test-cluster-pool-1. However only pool-2 was running at that moment. Also I noticed that ADDRESS and PORT for some Pods was not changed after deleting the node pool.

markmandel commented 5 years ago

So it seems if you delete a nodePool, then the Pods still exist inside Kubernetes?

I'm starting to think this might be a GKE bug!

aLekSer commented 5 years ago

Found such error messages on agones-controller: 2019-01-23 05:38:06.539 UTC-8 error creating gameserver for gameserverset fleet-example-x24zx: Internal error occurred: failed calling admission webhook "mutations.stable.agones.dev": Post https://agones-controller-service.agones-system.svc:443/mutate?timeout=30s: no endpoints available for service "agones-controller-service" which appears on calls to gameserversets.(Controller).syncMoreGameServers() and (Controller).syncGameServerSetState()

Also note that after switching node_pool new nodes should be added to gameserver firewall game-server-firewall manually.

markmandel commented 5 years ago

ooh, I wonder if it's because the agones controller is being taken down -- and that means the webhook can't be fired - which may not be something we can actually fix? :confused:

markmandel commented 5 years ago

2 thoughts for next steps:

Add some logging here: https://github.com/GoogleCloudPlatform/agones/blob/master/pkg/gameservers/controller.go#L158

And see if the Pod deletion event gets fired when you switch the nodepools. I'm wondering if they don't, and that's what is causing the issue.

2) Setup the Agones controller to run on it's own nodepool, and then switch out the nodepool for the gameservers, and see if it happens there.

I'm wondering if there controller not being removed by the nodepool solves the issue (at least partially) - or at least provides a document-able workaround.

aLekSer commented 5 years ago

I will reproduce this on the latest master with these 2 steps

aLekSer commented 5 years ago

In the gameservers controller code:

if oldPod.Spec.NodeName != newPod.Spec.NodeName {

This condition works only on Pod creating, it fires two time - one on first pool, and on second. Note that oldPod.Spec.NodeName always empty.

After node pool switch all gs restarted, but game-server-firewall setting is missing, so nc -u is not working on gs after node switch. Another issues is during the process the connection to kubectl was refused:

kubectl get gs
No resources found.
The connection to the server 35.197.87.248 was refused - did you specify the right host or port?

However nc -u is working after node pool switch if run from Node itself:

gcloud ssh ...
toolbox
nc -u 127.0.0.1 7424

markmandel commented 5 years ago

After node pool switch all gs restarted, but game-server-firewall setting is missing, so nc -u is not working on gs after node switch.

A new nodepool will need to be told to have the firewall tag, it won't be included automatically - so I don't think that that part is a bug.

So apart from that, does it work?

aLekSer commented 5 years ago

@markmandel With separation of node_pools "agones-system" and "default" fleets are restarted well. If we add new node_pool and then delete "default", all GS got restarted on new nodes. I think this bug does not include "agones-system" node pool restart.

roberthbailey commented 5 years ago

@aLekSer - based on your last update, it sounds like Mark's guess above is likely correct: the problem occurs when the agones controller is down.

What I don't understand is why it wouldn't fix itself once the controller came back up. With a level triggered system (see thockin's nice presentation here) it shouldn't be an issue if a single "event" is missed; the controller should look at the current state when it comes up and make it match the desired state.

aLekSer commented 5 years ago

@roberthbailey Not quite sure about the root cause for now.

markmandel commented 5 years ago

Now that #1008 is written, I think we can close this, as we give advice on how to perform upgrades that mitigate this issue (what seems to mostly be a race condition).

Also, the advice to setup separate node pools in production also seems to resolve it.

googleforgames / agones

Moving cluster to a new node pool doesn't recreate all fleets #398