GameServer going to Unhealthy before either container has started + Shutdown() fails when unhealthy

pgilfillan commented 2 years ago

What happened: We have a system for starting individual GameServer resources for a match if there is no server from a Fleet available. Occasionally (not every time) these servers get stuck in an Unhealthy state and don't get removed- according to https://agones.dev/site/docs/guides/health-checking/#fleet-management-of-unhealthy-gameservers not getting removed makes sense ("If a GameServer moves into an Unhealthy state when it is not part of a Fleet, the GameServer will remain in the Unhealthy state until explicitly deleted") but the servers becoming Unhealthy to begin with is a problem. It's before either the main or sidecar container is ready:

λ kubectl get gs
NAME                             STATE       ADDRESS                                                 PORT   NODE                                              AGE
main-gs-aas-v299526-test-j49rc   Scheduled   ec2-18-184-219-205.eu-central-1.compute.amazonaws.com   7001   ip-192-171-57-122.eu-central-1.compute.internal   27s

# Gameserver was in the Scheduled state 27 seconds before this and is Unhealthy
λ k get gs
NAME                             STATE       ADDRESS                                                 PORT   NODE                                              AGE
main-gs-aas-v299526-test-j49rc   Unhealthy   ec2-18-184-219-205.eu-central-1.compute.amazonaws.com   7001   ip-192-171-57-122.eu-central-1.compute.internal   54s

# These were run after the above- server is unhealthy even though neither container is ready
λ kubectl get pods
NAME                                        READY   STATUS              RESTARTS   AGE
main-gs-aas-v299526-test-j49rc              0/2     ContainerCreating   0          109s

λ kubectl logs main-gs-aas-v299526-test-j49rc agones-gameserver-sidecar -f
Error from server (BadRequest): container "agones-gameserver-sidecar" in pod "main-gs-aas-v299526-test-j49rc" is waiting to start: ContainerCreating

This is the health config of the GameServer

health:
  failureThreshold: 8
  initialDelaySeconds: 5
  periodSeconds: 5

so it should have taken 5 + 8 * 5 = 45 seconds before it goes to unhealthy but there wasn't enough time for that after it was scheduled.

Interestingly even though the sidecar is ContainerCreating, kubectl describe pod events says that it had started (this was for a different server to above but same problem).

...
Containers:
  agones-gameserver-sidecar:
...
    State:          Waiting
      Reason:       ContainerCreating
...
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m53s  default-scheduler  Successfully assigned enc-main/main-gs-aas-v299526-test-pbx6l to ip-192-171-76-43.eu-central-1.compute.internal
  Normal  Pulled     2m38s  kubelet            Container image "gcr.io/agones-images/agones-sdk:1.19.0" already present on machine
  Normal  Created    2m38s  kubelet            Created container agones-gameserver-sidecar
  Normal  Started    2m38s  kubelet            Started container agones-gameserver-sidecar
  Normal  Pulling    2m38s  kubelet            Pulling image "250943950056.dkr.ecr.eu-central-1.amazonaws.com/argo_daily_test:build.737_cl.299526_Test"

kubectl describe gs event output:

Events:
  Type     Reason          Age   From                   Message
  ----     ------          ----  ----                   -------
  Normal   PortAllocation  3m8s  gameserver-controller  Port allocated
  Normal   Creating        3m8s  gameserver-controller  Pod main-gs-aas-v299526-test-pbx6l created
  Normal   Scheduled       70s   gameserver-controller  Address and port populated
  Warning  Unhealthy       21s   gameserver-sidecar     Health check failure

I think the issue is usually seen when new nodes are spun up so maybe the sidecar container is started in the background (it just doesn't say it's ready and no logs are available) and then the health checks fail because the main server image is still pulling (no cached image on the new node)? It can take a while to pull because of the size.

Another problem is that the server container starts sending health checks properly but remains Unhealthy. It will then shut itself down after a few minutes because it hasn't received a match (logic within the server code), but this fails because it can't change state ("GameServerState already unhealthy. Skipping update").

{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Health Ping Received","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:28:24.37107567Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Health stream closed.","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:28:24.371140588Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Received Shutdown request, adding to queue","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:28:24.3761116Z"}
{"gsKey":"updateState","message":"Enqueuing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:28:24.376169247Z"}
{"gsKey":"updateState","message":"Processing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:28:24.381497061Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Updating state","severity":"debug","source":"*sdkserver.SDKServer","state":"Shutdown","time":"2022-04-21T00:28:24.381530119Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServerState already unhealthy. Skipping update.","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:28:24.381548113Z"}

What you expected to happen: Three things:

The sidecar container is ready (shows 1/2 in kubectl get pods and logs are available) as soon as it's started, not just when the server container is started (assuming that's why it's stuck on ContainerCreating).
The health check waits until after the image is pulled before starting the timer (or until the image pull fails). I'm not sure how technically viable this is but it makes sense to me, I don't think it's great to configure initialDelaySeconds to something like 5 minutes just in case the image takes a while to be pulled. Maybe split up the initialDelaySeconds config somehow to account for pre container start time?
If the sidecar receives a shutdown request when Unhealthy then it switches to Shutdown state correctly.

How to reproduce it (as minimally and precisely as possible):

Have an image that takes > health check time to pull
Start single GameServer with that image

Anything else we need to know?: Top of the logs from the sidecar, after it had started

λ kubectl logs -f main-gs-aas-v299526-test-j49rc agones-gameserver-sidecar
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":"","Delay":0,"Timeout":0,"Test":"","TestSdkName":"","GRPCPort":9357,"HTTPPort":9358},"featureGates":"CustomFasSyncInterval=false\u0026Example=true\u0026NodeExternalDNS=true\u0026PlayerAllocationFilter=false\u0026PlayerTracking=false\u0026SDKGracefulTermination=false\u0026StateAllocationFilter=false","message":"Starting sdk sidecar","severity":"info","source":"main","time":"2022-04-21T00:18:41.369895949Z","version":"1.19.0"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Created GameServer sidecar","severity":"info","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:41.370918956Z"}
{"grpcEndpoint":"localhost:9357","message":"Starting SDKServer grpc service...","severity":"info","source":"main","time":"2022-04-21T00:18:41.37149882Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","health":{"periodSeconds":5,"failureThreshold":8,"initialDelaySeconds":5},"message":"Setting health configuration","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:41.472153619Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Starting GameServer health checking","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:41.472245556Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Starting SDKServer http health check...","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:41.472279076Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Starting workers...","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"info","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:41.472298631Z","workers":1}
{"httpEndpoint":"localhost:9358","message":"Starting SDKServer grpc-gateway...","severity":"info","source":"main","time":"2022-04-21T00:18:42.372834443Z"}
{"failureCount":1,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:51.474124646Z"}
{"failureCount":2,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:18:56.475155103Z"}
{"failureCount":3,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:01.47547058Z"}
{"failureCount":4,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:06.47631443Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Sending GameServer Event to connectedStreams","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:11.387222057Z"}
{"failureCount":5,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:11.476774679Z"}
{"failureCount":6,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:16.477161414Z"}
{"failureCount":7,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:21.477555406Z"}
{"failureCount":8,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:26.477698963Z"}
{"gameServerName":"main-gs-aas-v299526-test-j49rc","gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer has failed health check","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:26.47776578Z"}
{"gsKey":"updateState","message":"Enqueuing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:26.477816108Z"}
{"gsKey":"updateState","message":"Processing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:26.483253023Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Updating state","severity":"debug","source":"*sdkserver.SDKServer","state":"Unhealthy","time":"2022-04-21T00:19:26.483273909Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Sending GameServer Event to connectedStreams","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:26.504240487Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Event(v1.ObjectReference{Kind:\"GameServer\", Namespace:\"enc-main\", Name:\"main-gs-aas-v299526-test-j49rc\", UID:\"f0388441-6a37-4e4c-848c-7e022a98c509\", APIVersion:\"agones.dev/v1\", ResourceVersion:\"46188246\", FieldPath:\"\"}): type: 'Warning' reason: 'Unhealthy' Health check failure","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:26.505544445Z"}
{"failureCount":9,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:31.478034954Z"}
{"gameServerName":"main-gs-aas-v299526-test-j49rc","gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer has failed health check","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:31.478147892Z"}
{"gsKey":"updateState","message":"Enqueuing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:31.478171955Z"}
{"gsKey":"updateState","message":"Processing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:31.483511497Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Updating state","severity":"debug","source":"*sdkserver.SDKServer","state":"Unhealthy","time":"2022-04-21T00:19:31.483569335Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServerState already unhealthy. Skipping update.","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:31.483603068Z"}
{"failureCount":10,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:36.478548831Z"}
{"gameServerName":"main-gs-aas-v299526-test-j49rc","gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer has failed health check","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:36.478594529Z"}
{"gsKey":"updateState","message":"Enqueuing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:36.478607871Z"}
{"gsKey":"updateState","message":"Processing","queue":"agones.dev.enc-main.main-gs-aas-v299526-test-j49rc","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:36.483902008Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Updating state","severity":"debug","source":"*sdkserver.SDKServer","state":"Unhealthy","time":"2022-04-21T00:19:36.483955823Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServerState already unhealthy. Skipping update.","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:36.483982523Z"}
{"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"Sending GameServer Event to connectedStreams","severity":"debug","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:41.387544651Z"}
{"failureCount":11,"gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer Health Check failed","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:41.47970424Z"}
{"gameServerName":"main-gs-aas-v299526-test-j49rc","gsKey":"enc-main/main-gs-aas-v299526-test-j49rc","message":"GameServer has failed health check","severity":"warning","source":"*sdkserver.SDKServer","time":"2022-04-21T00:19:41.479742887Z"}
...

Environment:

Agones version: 1.19
Kubernetes version (use kubectl version): 1.21
Cloud provider or hardware configuration: AWS EKS
Install method (yaml/helm): helm

markmandel commented 2 years ago

Curious - rather than create an individual GameServer, why don't you increment your fleet size by one?

pgilfillan commented 2 years ago

There is no fleet- this is mostly for the development/testing stages where a lot of different people are using a lot of different build versions, and it would be prohibitively expensive (somewhat related to https://github.com/googleforgames/agones/issues/1782) to have fleets for all of them. An individual server is only started if there's no fleet.

markmandel commented 2 years ago

There is no fleet- this is mostly for the development/testing stages where a lot of different people are using a lot of different build versions, and it would be prohibitively expensive (somewhat related to #1782) to have fleets for all of them. An individual server is only started if there's no fleet.

🤔 if you can create a GameServer and then delete it, you can create a Fleet with 1 instance in it, no?

pgilfillan commented 2 years ago

Yeah we could but then that fleet wouldn't be deleted unless we had some backend cronjob/background process that removed it after some amount of time, and if we had that we could just use it on standalone GameServer's anyway. Basically it would remove the ability for a server version to shut itself down after playing the one match it was requested for.

markmandel commented 1 year ago

Closing this, as it's duplicate of #2966

googleforgames / agones

GameServer going to Unhealthy before either container has started + Shutdown() fails when unhealthy #2549