googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.09k stars 810 forks source link

agones-system gets stuck in "Terminating" #1778

Closed domgreen closed 4 years ago

domgreen commented 4 years ago

What happened:

When deleting the agones-system namespace it got stuck in the Terminating state.

What you expected to happen:

It manages to successfully terminate the namespace without manual intervention.

How to reproduce it (as minimally and precisely as possible):

Not 100% sure what if any special things happened in the cluster to make it get stuck in terminating but in general:

Anything else we need to know?: Some commands I used to get it to delete:

kubectl get ns                                                                                                                                                                                                                    

NAME              STATUS        AGE                                                                                                                                                                                                    
agones-system     Terminating   4d
kubectl api-resources --verbs=list --namespaced -o name \
  | xargs -n 1 kubectl get --show-kind --ignore-not-found -n agones-system

error: unable to retrieve the complete list of server APIs: allocation.agones.dev/v1: the server is currently unable to handle the request
kubectl get ns agones-system -o json | jq                                                                                                                                                                                         

{                                                                                                                                                                                                                                      
  "apiVersion": "v1",
  "kind": "Namespace",
  "metadata": {
    "creationTimestamp": "...",
    "deletionTimestamp": "...",
    "name": "agones-system",
    "resourceVersion": "15278949",
    "selfLink": "/api/v1/namespaces/agones-system",
    "uid": "..."
  },
  "spec": {
    "finalizers": [
      "kubernetes"
    ]
  },
  "status": {
    "conditions": [
      {
        "lastTransitionTime": "...",
        "message": "Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: allocation.agones.dev/v1: the server is currently unable to handle the request",
        "reason": "DiscoveryFailed",
        "status": "True",
        "type": "NamespaceDeletionDiscoveryFailure"
      },
      {
        "lastTransitionTime": "...",
        "message": "All legacy kube types successfully parsed",
        "reason": "ParsedGroupVersions",
        "status": "False",
        "type": "NamespaceDeletionGroupVersionParsingFailure"
      },
      {
        "lastTransitionTime": "...",
        "message": "All content successfully deleted",
        "reason": "ContentDeleted",
        "status": "False",
        "type": "NamespaceDeletionContentFailure"
      }
    ],
    "phase": "Terminating"
  }
}
kubectl delete apiservice -n agones-system v1.allocation.agones.dev                                           

warning: deleting cluster-scoped resources, not scoped to the provided namespace
apiservice.apiregistration.k8s.io "v1.allocation.agones.dev" deleted

Finally followed this guide to help remove the namespace https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.1/troubleshoot/ns_terminating.html

Environment:

aLekSer commented 4 years ago

I am able to reproduce this. Not sure if this is related to the issue, but there are some warnings in events.

kubectl get events
LAST SEEN   TYPE      REASON                   OBJECT                                MESSAGE
4m28s       Warning   FailedToCreateEndpoint   endpoints/agones-allocator            Failed to create endpoint for service agones-system/agones-allocator: endpoints "agones-allocator" is forbidden: unable to create new content in namespace agones-system because it is being terminated
4m50s       Warning   FailedToCreateEndpoint   endpoints/agones-controller-service   Failed to create endpoint for service agones-system/agones-controller-service: endpoints "agones-controller-service" is forbidden:
unable to create new content in namespace agones-system because it is being terminated
4m29s       Warning   FailedToCreateEndpoint   endpoints/agones-ping-http-service    Failed to create endpoint for service agones-system/agones-ping-http-service: endpoints "agones-ping-http-service" is forbidden: unable to create new content in namespace agones-system because it is being terminated
4m29s       Warning   FailedToCreateEndpoint   endpoints/agones-ping-udp-service     Failed to create endpoint for service agones-system/agones-ping-udp-service: endpoints "agones-ping-udp-service" is forbidden: unable to create new content in namespace agones-system because it is being terminated

This might help in understanding better the situation and Kubernetes 1.16 (I did a test with 1.15 GKE cluster initially) would give more details in kubectl get ns agones-system I expect. https://github.com/kubernetes/kubernetes/issues/70916

aLekSer commented 4 years ago

I installed agones with Terraform Helm module, latest master, GKE 1.16.13-gke.1 and received a different kubectl get ns output:

k get ns agones-system -o yaml
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2020-09-01T16:06:03Z"
  deletionTimestamp: "2020-09-01T16:11:29Z"
  labels:
    name: agones-system
  name: agones-system
  resourceVersion: "3057"
  selfLink: /api/v1/namespaces/agones-system
  uid: 4b3d77b9-8765-40f6-a472-2b74a46e84fe
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2020-09-01T16:11:41Z"
    message: 'Discovery failed for some groups, 1 failing: unable to retrieve the
      complete list of server APIs: allocation.agones.dev/v1: the server is currently
      unable to handle the request'
    reason: DiscoveryFailed
    status: "True"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2020-09-01T16:11:35Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2020-09-01T16:12:05Z"
    message: 'Failed to delete all resource types, 1 remaining: unexpected items still
      remain in namespace: agones-system for gvr: /v1, Resource=pods'
    reason: ContentDeletionFailed
    status: "True"
    type: NamespaceDeletionContentFailure
  phase: Terminating
markmandel commented 4 years ago

Couple of questions:

  1. Which namespaces are you creating Agones and the GameServer in?
  2. Do you delete the GameServer before deleting Agones?
domgreen commented 4 years ago

Couple of questions:

  1. Which namespaces are you creating Agones and the GameServer in?
  1. Do you delete the GameServer before deleting Agones?

Nope, was basically trashing the cluster so wasnt being very gentle :worried:

markmandel commented 4 years ago

Hmnn. Interesting.

Usually when I've run into this, it's because of a Finaliser issue - but we only set a Finaliser on the GameServer - which is not in the agones-system namespace. :thinking:

aLekSer commented 4 years ago

Well, this bug about deleting Agones controller in unusual way, which is not documented on agones.dev: by simply removing agones-system namespace. You could use kubectl delete -f install.yaml before removing the namespace and it would work.

roberthbailey commented 4 years ago

I think the finalizer in the agones-system namespace is doing the right thing.

You need to uninstall agones before deleting the namespace, because there are CRDs installed with webhooks referencing the namespace where the agones controller is running.

markmandel commented 4 years ago

You need to uninstall agones before deleting the namespace, because there are CRDs installed with webhooks referencing the namespace where the agones controller is running.

Oooooh! That would make sense actually.

domgreen commented 4 years ago

Yep, makes alot of sense. Worth adding something to docs or FAQ?

Will see if I can find a way around it for my use case (terraform destroy).

aLekSer commented 4 years ago

We don't have a section about Agones uninstall in Install with YAML section. Which is a difference to Install using Helm. https://agones.dev/site/docs/installation/install-agones/yaml/

markmandel commented 4 years ago

We don't have a section about Agones uninstall in Install with YAML section. Which is a difference to Install using Helm. https://agones.dev/site/docs/installation/install-agones/yaml/

^ That definitely seems like a good addition!

aLekSer commented 4 years ago

Well, I will create a PR soon, simple changing agones-system to agones-system2 (1.9.0-dev to 1.8.0) in install.yaml was enough to create Agones controller in a new namespace. (Only thing is certificate is valid for agones-controller-service.agones-system.svc, not agones-controller-service.agones-system2.svc) after this changes kubectl apply -f ./install.yaml and kubectl delete -f ./install.yaml stuck on

validatingwebhookconfiguration.admissionregistration.k8s.io "agones-validation-webhook" deleted

However kubectl delete ns agones-system2 did not timeout and was successful.

kubectl get ns agones-system2  -o yaml
apiVersion: v1
kind: Namespace
metadata:
  creationTimestamp: "2020-09-01T19:53:25Z"
  deletionTimestamp: "2020-09-01T19:56:25Z"
  name: agones-system2
  resourceVersion: "64933"
  selfLink: /api/v1/namespaces/agones-system2
  uid: ...
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2020-09-01T19:56:31Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2020-09-01T19:56:31Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2020-09-01T19:56:31Z"
    message: All content successfully deleted
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  phase: Terminating
kubectl get ns agones-system2  -o yaml
Error from server (NotFound): namespaces "agones-system2" not found