googleforgames / agones

Dedicated Game Server Hosting and Scaling for Multiplayer Games on Kubernetes
https://agones.dev
Apache License 2.0
6.04k stars 800 forks source link

bad flakiness in e2e tests on Autopilot 1.29 #3746

Closed markmandel closed 2 months ago

markmandel commented 5 months ago

Seems to be pretty consistent on these tests:

time="2024-04-05 17:26:01.545" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc000785730 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:03.546" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc00047cfb0 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:05.545" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc00059f890 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:07.544" level=info msg="Checking Fleet Ready replicas" expected=5 fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:0xc000785bd0 Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.155" level=info msg="error waiting for fleet condition, dumping Fleet and Gameserver data" error="context deadline exceeded" fleet=simple-fleet-1.0l4pgd fleetStatus="{Replicas:0 ReadyReplicas:0 ReservedReplicas:0 AllocatedReplicas:0 Players:<nil> Counters:map[] Lists:map[]}" test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.155" level=info msg="Dumping Events:" fleet=simple-fleet-1.0l4pgd kind= test=TestAllocatorWithSelectors
time="2024-04-05 17:26:09.292" level=info msg="Event!" fleet=simple-fleet-1.0l4pgd lastTimestamp="2024-04-05 17:16:09 +0000 UTC" message="Created GameServerSet simple-fleet-1.0l4pgd-qzg65" reason=CreatingGameServerSet test=TestAllocatorWithSelectors type=Normal
    framework.go:346: 
            Error Trace:    /go/src/agones.dev/agones/test/e2e/framework/framework.go:346
                                        /go/src/agones.dev/agones/test/e2e/allocator_test.go:172
            Error:          Received unexpected error:
                            context deadline exceeded
            Test:           TestAllocatorWithSelectors
            Messages:       error waiting for fleet condition on fleet: simple-fleet-1.0l4pgd
--- FAIL: TestAllocatorWithSelectors (602.32s)

Also

time="2024-04-05 17:06:04.671" level=info msg="ServiceAccount 1712336764/agones-sdk is created"
time="2024-04-05 17:06:04.846" level=info msg="RoleBinding 1712336764/agones-sdk-access is created"
panic: test timed out after 30m0s
running tests:
    TestRestAllocatorWithDeprecatedRequired (9m55s)

goroutine 964 [running]:
testing.(*M).startAlarm.func1()
    /usr/local/go/src/testing/testing.go:2259 +0x259
created by time.goFunc
    /usr/local/go/src/time/sleep.go:176 +0x45

goroutine 1 [chan receive, 11 minutes]:
testing.(*T).Run(0xc000105860, {0x2a4a031, 0x27}, 0x2b1a110)
    /usr/local/go/src/testing/testing.go:1649 +0x871
testing.runTests.func1(0x0?)
    /usr/local/go/src/testing/testing.go:2054 +0x85
testing.tRunner(0xc000105860, 0xc000821868)
    /usr/local/go/src/testing/testing.go:1595 +0x262
testing.runTests(0xc0001aae60?, {0x3c47920, 0x61, 0x61}, {0xc0008218e0?, 0x441c74?, 0x3c59f20?})
    /usr/local/go/src/testing/testing.go:2052 +0x8ae
testing.(*M).Run(0xc0001aae60)
    /usr/local/go/src/testing/testing.go:1925 +0xcd8
agones.dev/agones/test/e2e.TestMain(0x10c20f37eb6e55c0?)
    /go/src/agones.dev/agones/test/e2e/main_test.go:94 +0xaa5
main.main()
    _testmain.go:241 +0x308

goroutine 656 [select]:
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext({0x2db0398, 0xc00028c380}, {0x2da2678?, 0xc00012e240}, 0x1, 0x0, 0x0?)
    /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/loop.go:91 +0x3d3
k8s.io/apimachinery/pkg/util/wait.PollUntilContextTimeout({0x2db0248, 0x3c8a900}, 0x77359400, 0x1?, 0x1?, 0x5?)
    /go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:48 +0xd9
agones.dev/agones/test/e2e/framework.(*Framework).WaitForFleetCondition(0xc0003f4ab0, 0xc000524b60, 0xc0001df200, 0xc000aa1d40)
    /go/src/agones.dev/agones/test/e2e/framework/framework.go:353 +0x365
agones.dev/agones/test/e2e/framework.(*Framework).AssertFleetCondition(0x2d95540?, 0xc000524b60?, 0xc0001df200, 0x0?)
    /go/src/agones.dev/agones/test/e2e/framework/framework.go:345 +0x5a
agones.dev/agones/test/e2e.TestRestAllocatorWithDeprecatedRequired(0xc000524b60)
    /go/src/agones.dev/agones/test/e2e/allocator_test.go:252 +0x325
testing.tRunner(0xc000524b60, 0x2b1a110)
    /usr/local/go/src/testing/testing.go:1595 +0x262
created by testing.(*T).Run in goroutine 1
    /usr/local/go/src/testing/testing.go:1648 +0x846

Retry might be working in some cases, but this panic is definitely stopping retries -- and 30m is a long time for a 5 GameServer fleet not to spin up. It's very odd.

Some links to failed builds:

markmandel commented 5 months ago

Just looking in the cluster, I do see:

markmandel@cloudshell:~ (agones-images)$ kubectl get ns
NAME                       STATUS        AGE
1711579203                 Terminating   8d
1711696994                 Terminating   7d10h
1711795766                 Terminating   6d7h
1712076562                 Terminating   3d1h
1712340564                 Active        6m34s
agones-system              Active        17d
default                    Active        43d
gke-gmp-system             Active        43d
gke-managed-cim            Active        43d
gke-managed-filestorecsi   Active        43d
gke-managed-system         Active        43d
gmp-public                 Active        43d
kube-node-lease            Active        43d
kube-public                Active        43d
kube-system                Active        43d

...which is weird.

markmandel@cloudshell:~ (agones-images)$ kubectl describe ns 1711579203
Name:         1711579203
Labels:       kubernetes.io/metadata.name=1711579203
              owner=e2e-test
Annotations:  <none>
Status:       Terminating
Conditions:
  Type                                         Status  LastTransitionTime               Reason                 Message
  ----                                         ------  ------------------               ------                 -------
  NamespaceDeletionDiscoveryFailure            False   Fri, 05 Apr 2024 16:43:44 +0000  ResourcesDiscovered    All resources successfully discovered
  NamespaceDeletionGroupVersionParsingFailure  False   Wed, 27 Mar 2024 23:07:22 +0000  ParsedGroupVersions    All legacy kube types successfully parsed
  NamespaceDeletionContentFailure              True    Wed, 27 Mar 2024 23:07:22 +0000  ContentDeletionFailed  Failed to delete all resource types, 1 remaining: unexpected items still remain in namespace: 1711579203 for gvr: /v1, Resource=pods
  NamespaceContentRemaining                    True    Wed, 27 Mar 2024 23:07:22 +0000  SomeResourcesRemain    Some resources are remaining: gameservers.agones.dev has 12 resource instances, pods. has 12 resource instances
  NamespaceFinalizersRemaining                 True    Wed, 27 Mar 2024 23:07:22 +0000  SomeFinalizersRemain   Some content in the namespace has finalizers remaining: agones.dev in 12 resource instances
markmandel@cloudshell:~ (agones-images)$ kubectl get pods -n 1711579203
NAME                          READY   STATUS        RESTARTS   AGE
preferred-w4gnw-dkjjx-5s6bs   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-8h7wx   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-bkd7m   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-c4bdk   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-dj84f   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-gjhw7   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-npfvd   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-qnp8n   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-sl5qm   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-tmdqt   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-tx5wf   0/2     Terminating   0          8d
preferred-w4gnw-dkjjx-wcs2m   0/2     Terminating   0          8d
markmandel@cloudshell:~ (agones-images)$ kubectl get pod -n 1711579203 preferred-w4gnw-dkjjx-5s6bs -o yaml 
apiVersion: v1
kind: Pod
metadata:
  annotations:
    agones.dev/container: game-server
    agones.dev/ready-container-id: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
    agones.dev/sdk-version: 1.40.0-dev-5c2e964
    autopilot.gke.io/host-port-assignment: '{"min":7000,"max":8000,"portsAssigned":{"1":7240}}'
    autopilot.gke.io/primary-container: game-server
    autopilot.gke.io/resource-adjustment: '{"input":{"containers":[{"requests":{"cpu":"30m"},"name":"agones-gameserver-sidecar"},{"limits":{"cpu":"30m","memory":"32Mi"},"requests":{"cpu":"30m","memory":"32Mi"},"name":"game-server"}]},"output":{"containers":[{"limits":{"ephemeral-storage":"1Gi"},"requests":{"cpu":"30m","ephemeral-storage":"1Gi","memory":"2Gi"},"name":"agones-gameserver-sidecar"},{"limits":{"cpu":"470m","ephemeral-storage":"1Gi","memory":"62Mi"},"requests":{"cpu":"470m","ephemeral-storage":"1Gi","memory":"62Mi"},"name":"game-server"}]},"modified":true}'
    autopilot.gke.io/warden-version: 2.9.20
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
  creationTimestamp: "2024-03-27T22:54:37Z"
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2024-03-27T23:06:09Z"
  labels:
    agones.dev/gameserver: preferred-w4gnw-dkjjx-5s6bs
    agones.dev/role: gameserver
    agones.dev/safe-to-evict: "false"
  name: preferred-w4gnw-dkjjx-5s6bs
  namespace: "1711579203"
  ownerReferences:
  - apiVersion: agones.dev/v1
    blockOwnerDeletion: true
    controller: true
    kind: GameServer
    name: preferred-w4gnw-dkjjx-5s6bs
    uid: ec61fed6-63c7-4c80-9a85-9e7dc70d5015
  resourceVersion: "47363486"
  uid: d608d658-758d-4a07-960b-7ada2526a6ad
spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchLabels:
              agones.dev/role: gameserver
          topologyKey: kubernetes.io/hostname
        weight: 100
  containers:
  - args:
    - --grpc-port=9357
    - --http-port=9358
    env:
    - name: GAMESERVER_NAME
      value: preferred-w4gnw-dkjjx-5s6bs
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: FEATURE_GATES
      value: CountsAndLists=true&DisableResyncOnSDKServer=false&Example=true&FleetAllocationOverflow=false&GKEAutopilotExtendedDurationPods=true&PlayerAllocationFilter=true&PlayerTracking=true
    - name: LOG_LEVEL
      value: Info
    image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.40.0-dev-5c2e964
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 3
      periodSeconds: 3
      successThreshold: 1
      timeoutSeconds: 1
    name: agones-gameserver-sidecar
    resources:
      limits:
        ephemeral-storage: 1Gi
      requests:
        cpu: 30m
        ephemeral-storage: 1Gi
        memory: 2Gi
    securityContext:
      capabilities:
        drop:
        - NET_RAW
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8jf5m
      readOnly: true
  - env:
    - name: AGONES_SDK_GRPC_PORT
      value: "9357"
    - name: AGONES_SDK_HTTP_PORT
      value: "9358"
    image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.28
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /gshealthz
        port: 8080
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: game-server
    ports:
    - containerPort: 7654
      hostPort: 7240
      protocol: UDP
    resources:
      limits:
        cpu: 470m
        ephemeral-storage: 1Gi
        memory: 62Mi
      requests:
        cpu: 470m
        ephemeral-storage: 1Gi
        memory: 62Mi
    securityContext:
      capabilities:
        drop:
        - NET_RAW
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: empty
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: preferred-w4gnw-dkjjx-5s6bs
  nodeName: gk3-gke-autopilot-e2e-te-nap-qpzyqh5z-032f3139-lcj6
  nodeSelector:
    cloud.google.com/extended-duration-pods: "0"
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: gke.io/optimize-utilization-scheduler
  securityContext:
    seccompProfile:
      type: Unconfined
  serviceAccount: agones-sdk
  serviceAccountName: agones-sdk
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: kubernetes.io/arch
    operator: Equal
    value: amd64
  volumes:
  - emptyDir: {}
    name: empty
  - name: kube-api-access-8jf5m
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T23:10:10Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T22:57:02Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T23:10:10Z"
    message: 'containers with unready status: [agones-gameserver-sidecar game-server]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T23:10:10Z"
    message: 'containers with unready status: [agones-gameserver-sidecar game-server]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-27T22:57:02Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://72f4fa8e6a90b882fe3be9eac808c0beed140feb526b22790f446fac46505c7a
    image: us-docker.pkg.dev/agones-images/ci/agones-sdk:1.40.0-dev-5c2e964
    imageID: us-docker.pkg.dev/agones-images/ci/agones-sdk@sha256:9aa78a749ccdf2b9b6c05484066e2d6f58e40843e87b218ec17cade8c04e267d
    lastState: {}
    name: agones-gameserver-sidecar
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://72f4fa8e6a90b882fe3be9eac808c0beed140feb526b22790f446fac46505c7a
        exitCode: 137
        finishedAt: "2024-03-27T23:06:09Z"
        reason: Error
        startedAt: "2024-03-27T22:57:12Z"
  - containerID: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
    image: us-docker.pkg.dev/agones-images/examples/simple-game-server:0.28
    imageID: us-docker.pkg.dev/agones-images/examples/simple-game-server@sha256:81d04419aa309fdb553a13be0614540da64a7df2ff9701f38a35c68398fe92e8
    lastState: {}
    name: game-server
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://f39362db1ec461c97ae03e1ca5d20df5d946c107d425ee1dd20bdebac57ffc78
        exitCode: 0
        finishedAt: "2024-03-27T23:05:39Z"
        reason: Completed
        startedAt: "2024-03-27T22:57:17Z"
  hostIP: 10.132.15.220
  hostIPs:
  - ip: 10.132.15.220
  phase: Running
  podIP: 10.64.134.69
  podIPs:
  - ip: 10.64.134.69
  qosClass: Burstable
  startTime: "2024-03-27T22:57:02Z"

So the pods are stuck for some reason?

markmandel commented 5 months ago

kubectl delete pod --all --force --grace-period=0 -n <namespace> did work to cleanup the Pods though.

markmandel commented 5 months ago

I did recently notice:

markmandel@cloudshell:~ (agones-images)$ kubectl get crds | grep agones
fleetautoscalers.autoscaling.agones.dev                2024-04-03T07:47:43Z
fleets.agones.dev                                      2024-04-03T07:47:42Z
gameserverallocationpolicies.multicluster.agones.dev   2024-04-03T07:47:43Z
gameserversets.agones.dev                              2024-04-03T07:47:44Z

But that seems to be after cleanup happened? GameServers where definitely there before.

Can CRDs go into a Terminating state?

zmerlynn commented 5 months ago

Tracking this in an internal bug, will report back.

markmandel commented 2 months ago

Resolved. Will close.