googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.04k stars 801 forks source link

GameServer stuck on state Scheduled when Pod failed with reason OutOfpods #2683

Closed katsew closed 4 months ago

katsew commented 2 years ago

What happened:

Agones didn't create a new Pod when a Pod failed due to reasons OutOfpods, and the GameServer stuck with state Scheduled.

What you expected to happen:

GameServer is expected to create a new Pod if a Pod fails due to reasons of OutOfpods.

How to reproduce it (as minimally and precisely as possible):

  1. Put the following manifest in /etc/kubernetes/manifests/static-pod.manifest of the testing node.
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: kube-system
  labels:
    component: nginx
    tier: node
spec:
  hostNetwork: true
  containers:
  - name: nginx
    image: nginx:1.14.2
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 80
    resources:
      requests:
        cpu: 100m
  priorityClassName: system-node-critical
  priority: 2000001000
  tolerations:
  - effect: NoExecute
    operator: Exists
  - effect: NoSchedule
    operator: Exists
  1. Set Fleet replicas to pod capacity of the node
  2. Confirm some of the gameserver pods stuck with state Pending.
  3. Forcibly delete static-pod created from step (1) in kube-system.
    • kubectl delete pod --force --grace-period=0 <static-pod-name> -n kube-system

All gameserver pods stuck with state Pending become failed with reason OutOfpods.

Anything else we need to know?:

Here is the Pod status that I reproduce.

status:
  message: 'Pod Node didn''t have enough resource: pods, requested: 1, used: 32, capacity:
    32'
  phase: Failed
  reason: OutOfpods

I created the Fleet from official document.

Environment:

unlightable commented 7 months ago

Although it did not occur for us anymore, this is still an issue worth solving. (Commenting to prevent auto closing)

markmandel commented 7 months ago

Rather than tackling this specifically, I'm thinking we actually wait on https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/ and implement that once it's available on all supported clusters -- which I believe should solve all these edge cases.

markmandel commented 7 months ago

PTAL at:

Let me know if you have any feedback.

github-actions[bot] commented 6 months ago

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

github-actions[bot] commented 4 months ago

This issue is marked as obsolete due to inactivity for last 60 days. To avoid issue getting closed in next 30 days, please add a comment or add 'awaiting-maintainer' label. Thank you for your contributions