Agones Fleet/Gameserver on GKE Autopilot that requests a GPU fails to create

kennycoder commented 1 year ago

What happened: Unable to start a fleet or a gameserver that has

          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-tesla-t4 # (or any other gpu)

in the spec on the GKE Autopilot cluster.

Digging through the agones-controller logs I can find the following:

{
   "message":"error creating Pod for GameServer sd-agones-fleet-xtcj6-j57cj: admission webhook \"warden-validating.common-webhooks.networking.gke.io\" denied the request: GKE Warden rejected the request because it violates one or more constraints.\nViolations details: {\"[denied by autogke-pod-limit-constraints]\":[\"container 'agones-gameserver-sidecar' ephemeral-storage requests '171798691' is lower than the Autopilot minimum required of '512Mi'.\",\"container 'simple-game-server' ephemeral-storage requests '171798691' is lower than the Autopilot minimum required of '512Mi'.\"]}\nRequested by user: 'system:serviceaccount:agones-system:agones-controller', groups: 'system:serviceaccounts,system:serviceaccounts:agones-system,system:authenticated'.",
   "severity":"error",
   "stack":[
      "agones.dev/agones/pkg/gameservers.(*Controller).createGameServerPod\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:613",
      "agones.dev/agones/pkg/gameservers.(*Controller).syncGameServerCreatingState\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:506",
      "agones.dev/agones/pkg/gameservers.(*Controller).syncGameServer\n\t/go/src/agones.dev/agones/pkg/gameservers/controller.go:404",
      "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).processNextWorkItem\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:182",
      "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).runWorker\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:158",
      "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).run.func1\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:217",
      "k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:157",
      "k8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:158",
      "k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:135",
      "k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:92",
      "agones.dev/agones/pkg/util/workerqueue.(*WorkerQueue).run\n\t/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:217",
      "runtime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"
   ],
   "time":"2023-10-02T18:01:54.805970304Z"
}

however it shouldn't be needed. I even override the requests manually - kubectl describe gs sd-agones-fleet-xtcj6-j57cj you can see that I get the following:

Spec:
      Containers:
        Image:  <my sidecar image>
        Name:   simple-game-server
        Resources:
          Limits:
            Cpu:     20m
            Memory:  64Mi
          Requests:
            Cpu:                  20m
            Ephemeral - Storage:  1Gi
            Memory:               64Mi
        Command:
          sleep
          1200
        Env:
          Name:             SAFETENSORS_FAST_GPU
          Value:            1
          Name:             LD_PRELOAD
          Value:            /usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4
        Image:              <my image>
        Image Pull Policy:  Always
        Name:               stable-diffusion-webui
        Resources:
          Limits:
            nvidia.com/gpu:  1
          Requests:
            Ephemeral - Storage:  1Gi
        Volume Mounts:
          Mount Path:  /sd_dir
          Name:        stable-diffusion-storage
      Node Selector:
        cloud.google.com/gke-accelerator:  nvidia-tesla-t4
      Volumes:
        Name:  stable-diffusion-storage
        Persistent Volume Claim:
          Claim Name:  gcs-fuse-csi-static-pvc
Status:
  Address:    
  Addresses:  <nil>
  Eviction:
    Safe:              Never
  Immutable Replicas:  1
  Node Name:           
  Players:             <nil>
  Ports:               <nil>
  Reserved Until:      <nil>
  State:               Creating
Events:
  Type     Reason          Age                From                   Message
  ----     ------          ----               ----                   -------
  Normal   PortAllocation  14s                gameserver-controller  Port allocated
  Warning  Creating        2s (x24 over 13s)  gameserver-controller  error creating Pod for GameServer sd-agones-fleet-zrlqk-zl8jx

What you expected to happen: Gameserver would be created

How to reproduce it (as minimally and precisely as possible): Try adding nodeSelector with gke-accelerator set for the workload. Has to be on Autopilot.

Anything else we need to know?:

Environment: GKE Autopilot

Agones version: 1.35.0
Kubernetes version (use kubectl version): Client Version: v1.27.1 Kustomize Version: v5.0.1 Server Version: v1.27.3-gke.100
Cloud provider or hardware configuration: Google Cloud
Install method (yaml/helm): YAML
Troubleshooting guide log(s) See above
Others: N/A

markmandel commented 1 year ago

creating Pod for GameServer sd-agones-fleet-xtcj6-j57cj: admission webhook \"warden-validating.common-webhooks.networking.gke.io\" denied the request: GKE Warden rejected the request because it violates one or more constraints.\nViolations details: {\"[denied by autogke-pod-limit-constraints]\":[\"container 'agones-gameserver-sidecar' ephemeral-storage requests '171798691' is lower than the Autopilot minimum required of '512Mi'.\",\"container 'simple-game-server' ephemeral-storage requests '171798691' is lower than the Autopilot minimum required of '512Mi'.\"]}\nRequested by user: 'system:serviceaccount:agones-system:agones-controller', groups: 'system:serviceaccounts,system:serviceaccounts:agones-system,system:authenticated'.",

This seems like the relevant bits. @zmerlynn this seems like it might be something you might be interested in.

zmerlynn commented 1 year ago

@kennycoder I just tried the following fleet on Autopilot 1.28 and had no issues:

apiVersion: agones.dev/v1
kind: Fleet
metadata:
  name: simple-game-server
spec:
  replicas: 1
  template:
    spec:
      ports:
        - name: default
          containerPort: 7654
      template:
        spec:
          nodeSelector:
            cloud.google.com/gke-accelerator: nvidia-tesla-t4
          containers:
            - name: simple-game-server
              image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.18
              resources:
                requests:
                  memory: 64Mi
                  cpu: 20m
                  nvidia.com/gpu: 1
                limits:
                  memory: 64Mi
                  cpu: 20m
                  nvidia.com/gpu: 1

I wonder if it's actually the PersistentVolumeClaim doing something weird?

I have a 1.27 cluster, I'll try there next.

zmerlynn commented 1 year ago

I have no problem on 1.27 either.

I'm trying to replicate your spec closer but I noticed in your describe output above that you have a sidecar mentioned in the GameServer as well. We do our best to default the primary game server container, but if you're running an additional sidecar, aside from the Agones SDK sidecar, you may need to supply more to satisfy Autopilot's constraints.

To help this along can you supply a full YAML for the GameServer? That may help me replicate it, if you still need help after that comment.

kennycoder commented 1 year ago

Apologies for the late reply. After some debugging I found out why. So if you are trying to use gcsfuse for PersistentVolume / PersistentVolumeClaims ( gcsfuse.csi.storage.gke.io ) it will fail with that error. Removing Volumes that rely on gcsfuse fixes the issue. Not sure if GKE AP or Agones related tho, will test outside of Agones ecosystem and report back. Thanks for your help

kennycoder commented 1 year ago

Ok, reporting back: Taking gke-gcsfuse/ephemeral-storage-limit: 50Gi annotation example from GCP's official documentation, I observed that this value is interpreted as a more or less arbitrary number "193273528", generating that error. Closing since doesn't look like Agones related. Thanks

zmerlynn commented 1 year ago

@kennycoder Oh, good find! FWIW, it's easier to create the pure GameServer equivalent of the Fleet then send that through kubectl - you should see the same issue. I'm happy to help debug if you can share more detailed repro steps (I.e. a manifest/etc).

As a possibility, does gcsfuse add a sidecar of its own? I wouldn't be surprised given the nature of the driver. It could be then related to some interaction with the Agones, gcsfuse and Autopilot webhooks and ordering of them, since we also insert a sidecar.

zmerlynn commented 1 year ago

Oh, even better! Glad you figured it out!

googleforgames / agones

Agones Fleet/Gameserver on GKE Autopilot that requests a GPU fails to create #3411