Sidecar occasionally fails to start up

TookTheRedBean commented 5 years ago

What happened: Out of 10 gameservers in a new fleet, 3 never left the 'Scheduled' state. Further investigation revealed that the game server pod successfully started, however the agones sidecar logged an error.

Truncated Logs: {"message":"Starting SDKServer grpc service...","severity":"info"} {"message":"Starting SDKServer grpc-gateway...","severity":"info"} {"error":"listen tcp 127.0.0.1:59358: bind: address already in use","message":"Could not serve http server","severity":"fatal"}

After deleting the affected gameservers, new gameservers and sidecars do start up without any issue.

What you expected to happen: If the sidecar of a gameserver fails to initialize, either recycle the gameserver or repair the sidecar.

How to reproduce it (as minimally and precisely as possible): No consistent repro steps. Seems to be most replicable when rapidly scaling up a new gameserverset, but is not consistent in its frequency.

Anything else we need to know?: Fleet Yaml:

apiVersion: stable.agones.dev/v1alpha1
kind: Fleet
metadata:
  name: battleserver
spec:
  replicas: 10
  template:
    spec:
      health:
        disabled: true
        initialDelaySeconds: 60
      ports:
      - containerPort: 443
        name: default
        protocol: TCP
      template:
        spec:
          containers:
            image: <REMOVED>
            name: battleserver

Environment:

Agones version: 0.10.0
Kubernetes version (use kubectl version): Client: "v1.14.2" Server: "v1.13.6-gke.13"
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): Helm
Troubleshooting guide log(s):
Others:

markmandel commented 5 years ago

Can you provide an example of what your fleet.yaml looks like?

You aren't starting anything else that opens ports in you game server / any other sidecar containers?

TookTheRedBean commented 5 years ago

Updated the original post with the fleet.yaml. No other side car containers, but the gameserver does have 3443, 1338, 443 and 80 exposed.

markmandel commented 5 years ago

Weird. I've never seen that before, ever.

Not sure we're tested on 1.13 at this stage (we're still on 1.11 at this stage, moving to 1.12 probably in this release or next now it has cloud penetration) - not sure if that is a factor? Does this happen on 1.11?

aLekSer commented 5 years ago

I was trying to reproduce this with xonotic image:

...
      template:
        spec:
          containers:
          - name: battleserver
            image: gcr.io/agones-images/xonotic-example:0.5

By the way I found similar issue mentioned in Kubernetes repo but there is Out of Memory situation in which it occurs: https://github.com/kubernetes/kubernetes/issues/69364

TookTheRedBean commented 5 years ago

Sorry for the delayed response here. We just rolled back to 1.11 [specifically 1.11.10-gke.5] on a new cluster and are still reproing the issue with the sidecar startup.

Notably, we've found that this issue does not occur if we disable GKE node pool autoscaling. Our new running hypothesis is that the error case may occur when new GameServers are allocated to a new node while it is still initializing.

markmandel commented 5 years ago

That is fascinating. Not sure if it matters, but is this a regional or zonal cluster?

roberthbailey commented 5 years ago

I've tried to reproduce this (with a zonal cluster) to no avail using the 0.12.0 release of agones with a gke cluster running 1.12.8-gke.10.

I created the cluster using the terraform template, so I had 3 node pools (agones-metrics, agones-system, and default). I enabled autoscaling on the default pool and let it scale from 0 to 20 nodes.

First, I tried slowly scaling up the number of game servers by creating a fleet autoscaler with a buffer of 10 and manually allocating gameservers. I created a fleet of simple udp servers and set the requests and limits to "1Gi" for memory and "1000m" for CPU so that they would cause new nodes to be allocated rather quickly.

After scaling up to >10 nodes (and seeing the GKE master disappear for a few minutes) I found two pods that had a sidecar restart. In my case they restarted because they failed healthchecks (likely because they couldn't reach the k8s apiserver) but I didn't see any errors about an address in use. These gameservers also moved successfully to the Ready state. The logs (for the failed run followed by the successful run) are:

$ kubectl logs -p battleserver-swgp7-dbljk agones-gameserver-sidecar
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":"","Timeout":0,"Test":""},"grpcPort":59357,"httpPort":59358,"message":"Starting sdk sidecar","severity":"info","source":"main","time":"2019-08-26T05:45:16.703604721Z","version":"0.12.0"}
$ kubectl logs battleserver-swgp7-dbljk agones-gameserver-sidecar
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":"","Timeout":0,"Test":""},"grpcPort":59357,"httpPort":59358,"message":"Starting sdk sidecar","severity":"info","source":"main","time":"2019-08-26T05:45:26.79034264Z","version":"0.12.0"}
{"gsKey":"default/battleserver-swgp7-dbljk","message":"created GameServer sidecar","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:45:26.797783797Z"}
{"message":"Starting SDKServer grpc service...","severity":"info","source":"main","time":"2019-08-26T05:45:26.798074732Z"}
{"message":"Starting SDKServer grpc-gateway...","severity":"info","source":"main","time":"2019-08-26T05:45:26.803588519Z"}
{"gsKey":"default/battleserver-swgp7-dbljk","health":{"disabled":true,"initialDelaySeconds":60},"message":"setting health configuration","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:45:26.898374391Z"}
{"gsKey":"default/battleserver-swgp7-dbljk","message":"Starting SDKServer http health check...","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:45:26.898545042Z"}
...

and

$ kubectl logs -p battleserver-swgp7-rnzqd agones-gameserver-sidecar
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":"","Timeout":0,"Test":""},"grpcPort":59357,"httpPort":59358,"message":"Starting sdk sidecar","severity":"info","source":"main","time":"2019-08-26T05:42:13.853690312Z","version":"0.12.0"}
$ kubectl logs battleserver-swgp7-rnzqd agones-gameserver-sidecar
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":"","Timeout":0,"Test":""},"grpcPort":59357,"httpPort":59358,"message":"Starting sdk sidecar","severity":"info","source":"main","time":"2019-08-26T05:42:22.863892623Z","version":"0.12.0"}
{"gsKey":"default/battleserver-swgp7-rnzqd","message":"created GameServer sidecar","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:42:22.873601091Z"}
{"message":"Starting SDKServer grpc service...","severity":"info","source":"main","time":"2019-08-26T05:42:22.873899239Z"}
{"message":"Starting SDKServer grpc-gateway...","severity":"info","source":"main","time":"2019-08-26T05:42:22.881191991Z"}
{"gsKey":"default/battleserver-swgp7-rnzqd","health":{"disabled":true,"initialDelaySeconds":60},"message":"setting health configuration","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:42:22.97412995Z"}
{"gsKey":"default/battleserver-swgp7-rnzqd","message":"Starting SDKServer http health check...","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:42:22.974306272Z"}
{"gsKey":"default/battleserver-swgp7-rnzqd","message":"Starting workers...","queue":"agones.dev.default.battleserver-swgp7-rnzqd","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:42:22.97434013Z","workers":1}
{"gsKey":"default/battleserver-swgp7-rnzqd","message":"Health Ping Received","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T05:42:23.934156311Z"}
...

Since that didn't work I tried to rapidly scale up by just creating a large fleet (without a fleet autoscaler). I deleted the prior fleet and fleet autoscaler and let the cluster autoscaler scale my default node pool down to 4 nodes. I then created a new fleet, this time with 35 game servers to see what would happen. The cluster autoscaler kicked in and scaled from 4 to 12 nodes and all of the gameservers went into the Ready state. This time more of the pods had container restarts (some even restarted twice!). The only gameserver sidecar that didn't show the single line (as above) had:

$ kubectl logs -p battleserver-w9nrf-lq6pg agones-gameserver-sidecar
{"ctlConf":{"Address":"localhost","IsLocal":false,"LocalFile":"","Timeout":0,"Test":""},"grpcPort":59357,"httpPort":59358,"message":"Starting sdk sidecar","severity":"info","source":"main","time":"2019-08-26T06:31:56.88294294Z","version":"0.12.0"}
{"gsKey":"default/battleserver-w9nrf-lq6pg","message":"created GameServer sidecar","severity":"info","source":"*sdkserver.SDKServer","time":"2019-08-26T06:32:06.89007051Z"}
{"message":"Starting SDKServer grpc service...","severity":"info","source":"main","time":"2019-08-26T06:32:06.890264477Z"}
{"message":"Starting SDKServer grpc-gateway...","severity":"info","source":"main","time":"2019-08-26T06:32:06.895092565Z"}
{"message":"shutting down sdk server","severity":"info","source":"main","time":"2019-08-26T06:32:06.929294847Z"}

but even in that case describing the pod shows that the container was killed because it failed the liveness healthcheck.

From my experiment it appears that the sidecar is being successfully restarted when it fails liveness healthchecks. Can you verify whether in your case the sidecar exited and restarted after getting the bind error or if it hung at that point? That error is severity fatal, which should cause the process to exit (and then the kubelet should restart it) but I'm wondering if you were seeing it get stuck instead of exiting.

drichardson commented 5 years ago

The side car port is 59358, which is in the ephemeral port range on a lot of (most?) systems. The side car should be listening on a port that is not in the ephemeral port range.

For example, on one of my Google Compute instances, the ephemeral port range is from 32768-60999:

$ cat /proc/sys/net/ipv4/ip_local_port_range
32768   60999

If a random TCP socket was assigned the side car port 59358 before the side car started, then you'd get a bind: address already in use error.

The fix for this problem is to give the side a non-ephemeral address, otherwise it is going to be an ongoing, difficult to reproduce issue.

For example, just pick something like 9358 (drop the 5) and you're probably fine on most systems. If you want to be ultra cautious, you can run some platform specific code to look at the ephemeral port ranges, but that's probably overkill.

roberthbailey commented 5 years ago

If a random TCP socket was assigned the side car port 59358 before the side car started, then you'd get a bind: address already in use error.

IIUC, It would need to be a program running within the pod, opening a TCP socket for listening, getting the port assigned to the sidecar port, and binding to that port prior to the sidecar opening it's connection, right? If there is something on the underlying host using port 59358 it shouldn't conflict because it isn't in the same network namespace (and it also aligns with the original bug report that restarting the affected gameservers fixes the issue).

drichardson commented 5 years ago

It doesn't matter if it's a listening socket. Every TCP socket needs a port number. Either you can bind one (using the bind function) or the OS will allocate one to you when you connect to the other end. This allocated port comes from the ephemeral port range.

If someone does, say, curl http://example.com before the side car runs, it's possible that the OS uses port 59358 as the TCP port for the client socket in the curl command. If that happens and the side car attempts to bind to 59358 while that curl is using the socket, you will get an address in use error.

drichardson commented 5 years ago

For example, running netstat -an on a google instance, you can see the a bunch of local addresses allocated in the ephemeral port range on non-listening sockets.

doug@openvpn:~$ netstat -an                                                                                             Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN
tcp        0      0 10.43.2.5:36468         169.254.169.254:80      ESTABLISHED
tcp        0      0 10.43.2.5:36466         169.254.169.254:80      ESTABLISHED
tcp        0    316 10.43.2.5:22            98.234.218.251:55655    ESTABLISHED
tcp        0      0 10.43.2.5:36470         169.254.169.254:80      ESTABLISHED
tcp        0      0 10.43.2.5:36464         169.254.169.254:80      CLOSE_WAIT
tcp6       0      0 :::22                   :::*                    LISTEN

roberthbailey commented 5 years ago

So we could have a degenerate case where the sdk client opens a tcp socket to connect to the server, gets assigned the port number for the server, and prevents the server from being able to bind to the predefined port. Fun.

drichardson commented 5 years ago

@roberthbailey exactly. Which is why it hardly ever reproduces. I'm actually surprised @rennekon had 3 servers out of 10 that had the problem in the initial report.

drichardson commented 5 years ago

@roberthbailey Here's a small C program to demonstrate the problem.

https://github.com/drichardson/examples/blob/master/network/ephemeral.c

Here's the result of running it several times in a row on an Ubuntu 18.04 instance GCE instance:

doug@instance-1:~/examples/network$ ./ephemeral                                                                         Connected to example.com...
Client was allocated port 44080
bind failed for server. 98: Address already in use
doug@instance-1:~/examples/network$ ./ephemeral                                                                         Connected to example.com...
Client was allocated port 44082
bind failed for server. 98: Address already in use
doug@instance-1:~/examples/network$ ./ephemeral                                                                         Connected to example.com...
Client was allocated port 44084
bind failed for server. 98: Address already in use

markmandel commented 5 years ago

I just came in here to say - I would never have worked this out. Thanks for coming along!

markmandel commented 5 years ago

@roberthbailey had a random thought last night - not 100% sure if its a good idea or not.

Does it make sense to make the SDK configurable (maybe through the GameServer CRD?), with a good default? (Implementation wise, this could be passed down to all containers in the pod as an env variable)

Reason being - it might be an easier way to have backward compatibility / future proofing.

Maybe from the outset, the default is the original port, and those hitting this issue can configure it to a non-ephemeral port. Eventually we could move the default to a non ephemeral port, and ideally people will have updated their SDKs (and if they haven't they could use the config option to set it to the original port).

The other option is to go the other way, make it configurable, but the default is a non-ephemeral port, and those that don't want to update to the latest SDK have to set a configuration value to the previous port version.

Either way we would need to document things - but it would save us opening 2 ports, and means we never need to worry about gameservers opening on the same port as the SDK, as its configurable.

What do you think?

roberthbailey commented 5 years ago

I guess the question is whether we think making it configurable is the right long term API or is it just a bandaid to get us from the ephemeral port to a lower port number.

If we think it's the right API for the sidecar then it saves us from opening two ports and is less complex. The transition plan is relatively easy (as you described) -- I think we'd need to leave the port at the high number to start, update all of the SDKs to be configurable, wait a few releases, then switch the port.

If it's just a bandaid to help with the transition between the old port and the new port, then I think having the code complexity for a release or two is better because it is temporary and we don't pollute the API unnecessarily.

markmandel commented 5 years ago

That's a good point. I guess the only reason that would make it a good long term strategy, is to stop the inevitable "my game server starts up on the same port as the SDK, and Ii can't change it, what do I do"

I just figure via Murphy's Law that will happen at some point :man_shrugging:

drichardson commented 5 years ago

Maybe from the outset, the default is the original port, and those hitting this issue can configure it to a non-ephemeral port. Eventually we could move the default to a non ephemeral port, and ideally people will have updated their SDKs (and if they haven't they could use the config option to set it to the original port).

This is the solution I'd prefer. I'd add a warning message if the default port was used, letting people know that it's going to change in a release or two and that they need to explicitly set the port if they want to keep using the ephemeral port number.

Unless users can guarantee they will make no TCP connections before the sidecar starts up, this issue can impact anyone at anytime.

markmandel commented 5 years ago

I'd add a warning message if the default port was used, letting people know that it's going to change in a release or two and that they need to explicitly set the port if they want to keep using the ephemeral port number.

Maybe the warning can be on the Event in the GameServer - that might be the easiest place to put a warning. I can't think of any other place to do it.

drichardson commented 5 years ago

I'm sure everyone on this thread understands, but just to clarify for anyone else, there are actually two ports that need to be set: 59357 (gRPC) and 59358 (http)

roberthbailey commented 5 years ago

Given the Murphy's Law argument I'm convinced that we should make the sdkserver ports configurable.

Next question: how to express this in the API?

We already have a section called 'ports', but I think it would be confusing to put it in there. We are adding a section called 'logging' with a subfield for 'sdkserver' but I'm wondering if we should invert it and instead do something like:

spec:
  sdkserver:
    grpcPort: 7777
    httpPort: 7778
    logging: Error

Basically, add a configuration block for things related to the sdkserver sidecar. We previously made logging a block so that we could potentially add logging fields for other system services later, but if we have other system services, they may need other configuration anyway and it makes sense to me to keep things grouped by what you are configuring.

The logging blob was added after 1.0 and isn't released yet, so we have a short window where we can change it. If it sticks for 1.1.0 then we will need to think of a different way to represent this -- or maybe there is a better way to represent it irrespective of the logging parameters?

drichardson commented 5 years ago

FYI the default Unreal Engine server port is 7777, so may want to pick another one ;)

drichardson commented 5 years ago

(I also think it's Unity's default port for UNetManager)

markmandel commented 5 years ago

@roberthbailey I think that configuration is perfect. I had the exact same thoughts, and I think that is a great solution.

roberthbailey commented 5 years ago

/assign

markmandel commented 5 years ago

@aLekSer I just realised something else this will enable us to do - run the SDK conformance tests in parallel on different ports :+1: (because the sdk conformance tests take a long time right now, just because they are in serial) :smile:

roberthbailey commented 5 years ago

That would also be a good way to test that each SDK can actually connect to different ports.

roberthbailey commented 5 years ago

Next up is to design how we dynamically configure the port that the SDK connects to (right now it's compiled in).

Some options:

We add a special environment variable on the gameserver container with the port number
We add an extra command line argument to the gameserver container with the port number
We add an optional argument to the New() constructor for the SDK in each language and let gameserver developers who want to use a non-standard port determine how to configure the port
A combination of one or more of the above.

We want the solution to work across all SDKs / languages (to keep the controller simple).

roberthbailey commented 5 years ago

I was thinking about it more and I think that command line flags aren't going to work, since the gameserver main needs to parse them (not the sdk).

Passing an argument to New() could work, but it requires either recompiling the gameserver container to change the port, or writing custom code to pass values through.

Using environment variables seems more straightforward, but I was concerned that we might have conflicts with user supplied values or between the env and envfromsource in the gameserver container. I did a bit of reading and testing and convinced myself that injecting environment variables should work pretty well. See https://github.com/googleforgames/agones/pull/1086.

aLekSer commented 5 years ago

Yes, that's true I will configure them to run in parallel on different ports. Also I added a PR with test for HTTP port.

roberthbailey commented 5 years ago

Update:

The ports used by the sdkserver sidecar can be configured through the gameserver/fleet yaml (starting with the impending 1.1 release) and I've added support to all of the SDKs to connect to the port that is being used by the sidecar.

For backwards compatibility, the default ports are not going to be changed for the 1.1 release, but they will be changed in a future release to a lower numbered, non-ephemeral port so that by default we don't encounter this issue.

googleforgames / agones

Sidecar occasionally fails to start up #851