GameServer Safe Manual Cluster Node Scaling

markmandel commented 5 years ago

Problem

To build cluster autoscalers, we need a way to declare the state how many nodes there should be in a cluster -- that is GameServer aware, and does not shut down nodes that have GameServers on them that are Allocated.
Current auto scalers will not work, as they are built for stateless solutions - and thus will terminate nodes arbitrarily
This solution should be designed, such that it can be used on multiple clouds - as well as can be independently extended for those that run their own Kubernetes clusters on their own infrastructure.

Design

To control the number of nodes, we can use a CRD, and controller much as we would anything else.

If the CRD is not created in the cluster, then the controller will do nothing.

There could be multiple GameServerNode instances - for example, when you have multiple nodepools in a GKE cluster, you could have one for each node pool.

This would be an example for manipulating nodes on GKE, but if there are issues you see with other infrastructure, please raise them

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerNode
metadata:
spec:
  # the target number of nodes, to be reached safely.
  replicas: 3
  # defaults to auto detection, but could be: GKE, AKS, etc, or custom.
  # For initial implementation, we will do minikube (does nothing) and GKE
  provider: auto
  # this is a map[string]string, that has provider specific configuration. Validated through `Validate`
  config:
      nodepool: gameserver
  # if custom provider is set, then configure the service details here.
  grpcHook:
    # base64 https cert public key
    caBundle: "87zyhshjds---"
    # optional external cluster address
    address: domain:port
    # in cluster service reference (instead of address above)
    service:
      name: gameServerNode
      namespace: agones-system

Question: Not sold on the Kind "GameServerNode" -- anyone got anything better?
Question: I'm assuming we'll use the same cert that we use for webhooks, but the CRD creation will happen outside of the helm install process (at least in this design), so we should documentation on how to get the public key from the secret - I assume that this will work? Should test.

Provider Implementation

The cloud/custom implementation will be implemented as a gRPC service that could be hosted in cluster, or externally. Default implementations for cloud providers will be hosted internally and come bundled with Agones, This will be in a separate binary and deployment from the controller, but in the same namespace.

The proto definition for a K8s provider service:

syntax = "proto3";

service GameServerNode {
    // validates the configuration for a given provider
    rpc Validate(Config) returns (Valid) {}
    // increase the number of nodes. Should only ever increase!
    rpc Increase (Increase) returns (Empty) {}
    // delete one of more specific nodes
    rpc Delete (Nodes) returns (Empty) {}
}

// list of nodes to delete, with the config definitions
message Nodes {
    // list of the K8s node names
    repeated string names = 1;
    Config config = 2;
}

// increase target, sent with config definitions
message Increase {
    int64 increase_size = 1;
    Config config = 2;
}

// the annotations/config definitions
message Config {
    // which provider to use
    string provider = 1;
    map<string, string> values = 2;
}

message Valid {
    // is this config valid ?
    bool is_valid = 1;
    // if not, provide details
    repeated Detail details = 2;

    message Detail {
        // Type matches https://godoc.org/k8s.io/apimachinery/pkg/util/validation/field#ErrorType
        string type = 1;
        //  validation error message
        string message = 2;
        // the field in quesitons
        string field = 3;
    }
}

}
// I am Empty
message Empty {}

Node scaling strategy

Scaling up

Diff the current number of nodes with the target.
If there are nodes that are marked as "Unschedulable", then flip the diff'd number (or as many as are available) of them back to being "Schedulable"
- Sorted the nodes by largest amount of game pods per node, so that we open up the most active of nodes, and can scale down more easily the emptier nodes.
Add the number of nodes to reach the remaining difference to the target nodes, through the grpc service Increase

Scaling Down

Diff the current number of nodes with the target.
Identify the diff'd number of nodes with the least number of Allocated game servers on them and mark them as unschedable.
Delete non Allocated GameServers on the Scheduled nodes, so they can't move to Allocated and keep the node alive.
Once the node has zero game server pods on it, delete the nodes from K8s and use the grpc service Delete to implement the VM deletion of those nodes.

Research

history

markmandel commented 5 years ago

/cc @dgkanatsios , @BrianPeek , @Annonator

Do you see any issues with the above strategy on AKS?

victor-prodan commented 5 years ago

I dont see why we should have GKE config specified differently from other vendor. After all, this config is sent as is tot the Increase method, so it can be a generic key/value meta def. Unless I missunderstood?

dgkanatsios commented 5 years ago

I think what @markmandel means is that each cloud provider has their own API to do scale out/in on their respective managed cluster.

markmandel commented 5 years ago

Exactly what @dgkanatsios says - and also knowing that how to scale each cluster can be different depending on cloud providers. I know GKE uses NodePools, but that's not a ubiquitous construct (at least not AFAIK?)

victor-prodan commented 5 years ago

Of course each provider has a different method, hence the grpc communication with the actual logic. But in the end, the config needed to Increase () can be reduced to an abstract key/value meta for any of them, including GKE.

markmandel commented 5 years ago

@victor-prodan maybe I'm missing something - isn't that exactly what is being passed? Increase gets a int64 to target, and a map<string,string> -- albeit wrapped in some gRPC messages.

Or maybe I explained something badly?

markmandel commented 5 years ago

Or is this a config question?

Are you saying this should be something more like this:

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerNode
metadata:
spec:
  # the target number of nodes, to be reached safely.
  replicas: 3
  # defaults to auto detection, but could be: GKE, AKS, etc, or custom.
  # For initial implementation, we will do minikube (does nothing) and GKE
  provider: auto
  # this is a map[string]string
  config:
      nodepool: gameserver
  # if custom provider is set, then configure the service details here.
  grpcHook:
    # base64 https cert public key
    caBundle: "87zyhshjds---"
    # optional external cluster address
    address: domain:port
    # in cluster service reference (instead of address above)
    service:
      name: gameServerNode
      namespace: agones-system

That would work nicely as well!

victor-prodan commented 5 years ago

Yes, the grpc part looks very generic, but GameServerNode example seems to hint that GKE will get preferential treatment. You mentioned GKE and custom clusters as separate entities, and I wanted to make sure If that's intended. 😊

markmandel commented 5 years ago

The only downside to above, with a generic config is that we can't provide any OpenAPI validation rules for the configuration - we'd have to do it all through the ValidationWebHook, and we'd have to manually parse text to integers.

Unless I'm missing the point here - which is possible.

markmandel commented 5 years ago

oooh @victor-prodan I just meant that I would implement GKE first (just because I know it) -- but the config should work/or be extensible to all K8s providers, and they should be on equal footing. Absolutely! That's quite the point, I wanted to make sure the above would work for all.

markmandel commented 5 years ago

Another version of the config (which I may be leaning more towards, the more I think about it):

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerNode
metadata:
  # this is how you configure
spec:
  # the target number of nodes, to be reached safely.
  replicas: 3
  # defaults to auto detection, but could be: GKE, AKS, etc, or custom.
  # For initial implementation, we will do minikube (does nothing) and GKE
  provider: auto
    # gke config,  which is part of agones and validated
    gke:
      nodepool: my-nodepool
    # aks config, which is part of agones and is validated
    aks:
      whatevermakessensehere: a value that makes sense
    # custom config - which is a map[string]map - not validated, and just passed through.
    custom:
     key: value
  # if custom provider is set, then configure the service details here.
  grpcHook:
    # base64 https cert public key
    caBundle: "87zyhshjds---"
    # optional external cluster address
    address: domain:port
    # in cluster service reference (instead of address above)
    service:
      name: gameServerNode
      namespace: agones-system

Thoughts?

victor-prodan commented 5 years ago

@markmandel? Why not unifying validation as well? How about a Validate(config) method to grpc?

markmandel commented 5 years ago

@victor-prodan that is an excellent idea. Let me work that into my design. I like that a lot!

markmandel commented 5 years ago

Updated the design with @victor-prodan 's suggestion - that's a very elegant solution :+1:

Also made a note "This will be in a separate binary and deployment from the controller, but in the same namespace." - I think this should be in it's own binary, although have all the default cloud providers in the single binary (that binary does the auto detection -- and since we now have Validate - if it can't do auto detection, the Validate will fail, which is nice!). Let me know if you have objections.

Thinking about the Kind more, GameServerNode seems incorrect - since we're not scaling a node, it's more than one node. Some other ideas:

GameServerNodes
GameServerCluster
AgonesCluster
Cluster

I probably am leaning more towards "Cluster", since we have the apiversion. I just worry about collisions with kubectl tooling.

What do you all think?

victor-prodan commented 5 years ago

I would go for GameServerCluster as it's the most self-explanatory imo.

alexandrem commented 5 years ago

Just wanted to share a neat trick that you can do with CRDs and Kubernetes API machinery. Perhaps you could use that for those provider configs.

It's possible to have nested objects that are fully typed and expose them as such in the manifest definitions.

For instance, I used to have something like this:

apiVersion: stable.polykube.com/v1alpha1
kind: NodePool
metadata:
  name: test
spec:
  name: test
  provider: openstack
  providerConfig:
    apiVersion: "openstack.providers.polykube.com/v1alpha1"
    kind: "OpenstackConfig"
    flavorId: "2004"
    imageName: "coreos-stable"
    auth:
      authURL: http://localhost:5000/v3/
      domainName: Default
      region: RegionOne

What is under providerConfig is actually a valid registered schema type in Kubernetes. So it is strongly typed and can be decoded into a runtime object, with proper validation.

Personally, I had it like this:

// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
type OpenstackConfig struct {
  metav1.TypeMeta `json:",inline"`

  FlavorID   string `json:"flavorId,omitempty"`
  FlavorName string `json:"flavorName,omitempty"`
  ImageID    string `json:"imageId,omitempty"`
  ImageName  string `json:"imageName,omitempty"`

  // Network contains network config for the node pool nodes
  // +optional
  Network NetworkConfig `json:"network,omitempty"`

  // Auth is the authentication config for the Openstack API
  Auth AuthConfig `json:"auth,omitempty"`
}
...

The object cannot be instantiated alone, it must belong to another runtime object. The trick is with the runtime schema registration, then you also have to define a new kind of serializer and codec factory for decoding the serialized object.

This is a technique that I used in one former project to have more flexibility about the cloud provider node pools representation (and something I've borrowed from the Cluster API SIG).

I can give more implementation details if that's something interesting to you :)

EricFortin commented 5 years ago

What would happen in the GKE use case where we are using(possibly) multiple node pools. We would have multiple GameServerClusters in a single k8s cluster? I am also leaning toward GameServerNodes for this reason.

alexandrem commented 5 years ago

@EricFortin That's a good point, GameServerNodes makes more sense in that regard.

I might have some doubt about the direction of this work. What is the motivation behind developing this new gRPC interface and scaling logic separately?

Basically, why can't we leverage the work being done here: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler/cloudprovider

Can we decouple the intelligence of managing the stateful nature of dedicated game servers with the specific implementation of cloud provider infrastructure orchestration?

The way I saw the cluster autoscaling originally for Agones was to add some additional controllers with scaling strategies and generic configs, but then reuse as much possible the work being done in cluster-autoscaler by adding annotations on the nodes where the games and fleets run.

For instance, there's the cluster-autoscaler.kubernetes.io/scale-down-disabled annotation available already and few more. If that doesn't suffice, we could contribute to the project and do the integration.

markmandel commented 5 years ago

@EricFortin agreed -- that is an excellent point. I think we may all have reached quorum on GameServerNodes. 👍

@alexandrem you also raise a good point 👍 - honestly I'd pretty much dismissed the cluster auto scalers, but it looks like they have improved (I wasn't aware of the annotation!). I still have some concerns about scale down, but they may be unfounded -- I'll take another look! If we can take advantage of existing work, I'm all for it!!! If we can get cluster autoscaling just by adding some annotations to our GameServer pods, that's amazing.

The only other initial thought is me wondering if this work can be extended for custom clusters? But i will dig in, and likely defer this to other people who actually run their own custom clusters.

victor-prodan commented 5 years ago

It looks to me that node autoscaling might already be working with GKE. Their autoscaling doc says

A node's deletion could be prevented if it contains a Pod with any of these condtions:

The Pod's affinity or anti-affinity rules prevent rescheduling

This might mean we could just add a pod affinity rule when allocating its gs that prevents rescheduling and it will work.

On top of this, we should add pod affinity rules to schedule new pods on nodes with already allocated gs, and also to alocate gs with priority from nodes which have other allocated gs

Also, as GKE supports a pretty long grace period on deallocation (10m), some games with short gameplay rounds might work with it, leading to a more efficient waste reduction.

markmandel commented 5 years ago

I think we can do it with a pod annotation rather than affinity rules: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

We can add the annotation to the GameServer pod "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" -- which will prevent the node it is on from being deleted.

In theory, if we set the annotation at the same time we mark a pod as Allocated, that should work pretty well. I still have some concerns regarding race conditions, since the autoscaling code runs in a totally separate process, and what could happen if the master goes down midway between setting the pod and the gameserver allocated state -- but one problem at a time? :man_shrugging:

also to alocate gs with priority from nodes which have other allocated gs

Yes! This! :+1: I had the same thought this morning! We should likely do this anyway - as it means we don't have to create a custom scheduler, as well, which removes another chunk of work.

I've got the initial thoughts in my head for a design on how to do this, but I need to write them down. Once I've done so, I'll add them here/create a new ticket (Not sure?) and we can pick it apart.

I'm definitely leaning towards using the autoscaler as the way to go -- it saves us a lot of work.

markmandel commented 5 years ago

So I wrote up a really rough design using the autoscaler, I ran into a bunch of issues around scaling down, that I'm trying to work out how to resolve:

https://github.com/GoogleCloudPlatform/agones/issues/368

Have a look, would love your feedback. Not quite sure what the right way to go here is.

markmandel commented 5 years ago

Unless anyone has any objections, I think I'm going to close this issue.

I'm working through some of the issues that I'm discovering with #368 (cluster autoscaler), and I think it is the way to go. I think some of the concerns I have will need to be worked out through experimentation and seeing what breaks with the autoscaler - primarily around scaling down.

Worst case - we may need to handle more of scaling down ourselves, than the autoscaler will provide, but we can do that if need be.

markmandel commented 5 years ago

I see no objections :smile: closing in favour of #368

googleforgames / agones