googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.12k stars 818 forks source link

Stale GroupVersion discovery: allocation.agones.dev/v1 (Namespace deletion) #3172

Closed scrayos closed 1 year ago

scrayos commented 1 year ago

What happened: Any namespace deletion is stuck in Terminating even though Agones is still installed and the allocation.agones.dev/v1 API is available. The namespaces are stuck in this state indefinitely and can only be deleted by removing the NamespaceDeletionDiscoveryFailure finalizer. This is because the finalizer fails:

status:
  phase: Terminating
  conditions:
    - type: NamespaceDeletionDiscoveryFailure
      status: 'True'
      lastTransitionTime: '2023-05-19T18:41:31Z'
      reason: DiscoveryFailed
      message: >-
        Discovery failed for some groups, 1 failing: unable to retrieve the
        complete list of server APIs: allocation.agones.dev/v1: stale
        GroupVersion discovery: allocation.agones.dev/v1

The apiserver log reports DiscoveryManager: Failed to download discovery for agones-system/agones-controller-service:443: 404 404 page not found.

I've looked in the Kubernetes implementation and this error is thrown here. Any error returned by this method leads to the error in the NamespaceDeletionDiscoveryFailure, reporting stale GroupVersion discovery.

And while looking into the webhooks port of the agones-controller-service, indeed 404 is returned for /apis.

What you expected to happen: I'd expect namespaces to delete normally, even with Agones installed.

How to reproduce it (as minimally and precisely as possible):

  1. Install the Helm Chart with the values provided below
  2. Create a new namespace
  3. Try to delete this new namespace
  4. Observe it being stuck in Terminating

Anything else we need to know?:

Helm-Values:

agones:
  allocator:
    install: false
  controller:
    healthCheck:
      failureThreshold: 3
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 1
    persistentLogs: false
  featureGates: PlayerTracking=true&SDKGracefulTermination=true&StateAllocationFilter=true
  metrics:
    serviceMonitor:
      enabled: true
  ping:
    install: false
gameservers:
  namespaces:
  - minecraft

Environment:

markmandel commented 1 year ago

Thanks for the bug! Looks like something has changed in recent versions of Kubernetes. I'll take a look -- I wanted to look at this anyway for another reason.

markmandel commented 1 year ago

Just noting I'm seeing this consistently within 1.27.x clusters. The test namespaces aren't deleting.

markmandel commented 1 year ago

I've also filed this as a bug with Kubernetes: https://github.com/kubernetes/kubernetes/issues/119662 - since the new feature (Aggregated Discovery) seems to break backward compatibility.

zifter commented 5 months ago

I have faced with that issue after upgrading agones release from 1.34 to 1.41. The cause was misconfiguration - I accidentally turn off caBundle for extensions, so it became a reason of that behavior.

agones:
  extensions:
    allocationApiService:
      disableCaBundle: true

So, if you need to disable it, you have to manage you own certs for that.