hammerlab / secotrec

Setup Coclobas/Ketrew Clusters
Apache License 2.0
5 stars 6 forks source link

The default cluster settings might not be ideal to carry the workload that we do #72

Open armish opened 7 years ago

armish commented 7 years ago

Since people seem really concerned about the frequency you hit kubelet and other cores to make them live long and in prosper, some guys in black suits and ties have lots of articles about how they got to tame the clusters. I don't really believe their happily-ever-after endings, but they might have a point. See this default config for cluster setup:

SYNOPSIS
    gcloud container clusters create NAME [--additional-zones=ZONE,[ZONE,...]]
        [--async] [--cluster-ipv4-cidr=CLUSTER_IPV4_CIDR]
        [--cluster-version=CLUSTER_VERSION]
        [--disable-addons=[DISABLE_ADDONS,...]] [--disk-size=DISK_SIZE]
        [--no-enable-cloud-endpoints] [--no-enable-cloud-logging]
        [--no-enable-cloud-monitoring] [--image-type=IMAGE_TYPE]
        [--machine-type=MACHINE_TYPE, -m MACHINE_TYPE]
        [--max-nodes-per-pool=MAX_NODES_PER_POOL] [--network=NETWORK]
        [--node-labels=[NODE_LABEL,...]] [--num-nodes=NUM_NODES; default="3"]
        [--password=PASSWORD] [--scopes=SCOPE,[SCOPE,...]]
        [--subnetwork=SUBNETWORK] [--tags=TAG,[TAG,...]]
        [--username=USERNAME, -u USERNAME; default="admin"]
        [--zone=ZONE, -z ZONE] [GCLOUD_WIDE_FLAG ...]

Wasn't able to spend too much of a time on it, but my understanding is that almost all the components of the cluster are optional and there are interesting alternatives for each of the services. For example, if we are not planning to make heavy use of the StackDriver, why not ditch the endpoints api, the monitoring and the cloud logging options? (Mind me if our setup heavily relies on them since I am still trying to figure out the magic behind the coclo+ketrew+seco triplet.

Also, I don't why (since we always can proxy into the cluster and access nodes/pods, the default GKE settings add new network rules for each of the nodes (I thought that they should be on a virtual/overlay network and not face public? This is normally not an issue (since it is good to be able to just ssh into them) it is significant capping our scaling fast and getting things done fast capacity (due to quotas on the number of IP/Routes). Have to look into further (we can split this discussion into multiple parts if needed :))

But if you also agree, it might worth giving a shot?


Relevant