Cluster size blog post - Githubissues

weibeld commented 4 years ago

https://deploy-preview-280--learnk8s.netlify.com/how-many-clusters

danielepolencic commented 4 years ago

Preface. When I started looking into this, I thought it could be an interesting and reasonable effort blog post. However, as soon as I started digging into this, I realised that it's quite complex and vast. I think getting this right is not trivial and it might be a series of blog posts instead of just one.

General notes:

"Big cluster" — define "big". I think here the term is overloaded and we should try to avoid misunderstandings by either providing a definition or avoiding the term completely. Big as in Pods, Nodes, teams, resources, environments?
The dimensions covered are many or fewer apps per cluster. However, there's a third dimension that is only considered partially: teams. You could have clusters by teams. This is popular too.
It's worth clarifying that "single app" is probably a collection of apps that work together to solve a problem. I will never deploy SSO into a cluster and a user service into another. Instead, SSO and user service stay together (same cluster).
It'd be nice to have a recap table.
I think there are some scenarios that are more interesting (or likely) than others. "One cluster per environment" is very popular and it's worth discussing. "One cluster per application" is not very popular. Most of the research suggests that prod is always a different cluster. I feel some of the points are repeated 4 times, making it hard to follow along. I wonder if instead of structuring the content in Cluster A, cluster B, cluster C, we list the features (isolation, costs, etc.) and we score the three/four scenarios against (see example below).
There are some points that you make that could be a bit stronger. When you talk about resource sharing, a good one to mention is DNS. If you have namespaces, you can use the DNS to retrieve all services in the cluster.
It's worth mentioning updates. With smaller clusters, we can do blue-green deployments of clusters easily. With a bigger cluster (number of Pods), it's much harder to roll out coordinated updates.
When it comes to costs, masters are not the only issue. You need more ingress controllers, more logging daemon/services, alerting, etc. The majority of the costs might not come from masters, but from Nth copies of fluentd (in a "large" cluster I might have only one).

My suggestion would be to refactor the content to be a list of points:

Authentication
Updates
Costs
Resilience

And for each of them, we score a particular scenario.

Example for Authentication:

Authentication in a single cluster is straightforward. You set up authn once. In Nth clusters, you need a way to sync users. You could set up OIDC or use something like Guard. Overall more complicated the more users and more clusters that you have.

The format should give us the opportunity to:

list the challenges
list the mitigations
suggest tools

weibeld commented 4 years ago

I just edited the article again, especially shortening much of the content.

"Big cluster" — define "big". I think here the term is overloaded and we should try to avoid misunderstandings by either providing a definition or avoiding the term completely. Big as in Pods, Nodes, teams, resources, environments?

I added a definition that it's in terms of nodes and pods.

The dimensions covered are many or fewer apps per cluster. However, there's a third dimension that is only considered partially: teams. You could have clusters by teams. This is popular too.

This can be seen as a special case of a cluster per app, which was also mentioned in the article.

It's worth clarifying that "single app" is probably a collection of apps that work together to solve a problem. I will never deploy SSO into a cluster and a user service into another. Instead, SSO and user service stay together (same cluster).

This should be obvious from the illustrations, but I made it more explicit in the text too.

It'd be nice to have a recap table.

I added a table at the end.

I think there are some scenarios that are more interesting (or likely) than others. "One cluster per environment" is very popular and it's worth discussing. "One cluster per application" is not very popular. Most of the research suggests that prod is always a different cluster. I feel some of the points are repeated 4 times, making it hard to follow along. I wonder if instead of structuring the content in Cluster A, cluster B, cluster C, we list the features (isolation, costs, etc.) and we score the three/four scenarios against (see example below).

The idea was to use the most obvious general dimensions to cover as much as possible of the space. Special cases can then be derived from there.

There are some points that you make that could be a bit stronger. When you talk about resource sharing, a good one to mention is DNS. If you have namespaces, you can use the DNS to retrieve all services in the cluster.

I also added the example with DNS.

It's worth mentioning updates. With smaller clusters, we can do blue-green deployments of clusters easily. With a bigger cluster (number of Pods), it's much harder to roll out coordinated updates.

I don't understand this.

When it comes to costs, masters are not the only issue. You need more ingress controllers, more logging daemon/services, alerting, etc. The majority of the costs might not come from masters, but from Nth copies of fluentd (in a "large" cluster I might have only one).

fluentd doesn't really cost money. It at most uses resources, which is covered in the "Inefficient resource usage" point. There may be paid services which are paid by the cluster, but I'm not sure if, for example, with Datadog you have to pay for each monitored cluster or if you can use the same monitor for multiple clusters. Same for other paid services.

My suggestion would be to refactor the content to be a list of points:

With this other approach, the focus would probably be more on tools and less on the difference between large and small clusters. Also it's hard to justify this specific set of features, as it's kind of arbitrary to choose these and not others.

danielepolencic commented 4 years ago

Few more items that we could improve:

[x] add a TL;DR with the recap table at the end.
[ ] Your diagrams are getting better and more frequent. I like that a lot. I think the next step is using a softer palette of colours. As an example, have a look at these

Also, I'm investigating converting all fonts into shapes to address your concerns. It won't happen soon, but I'm on it.

weibeld commented 4 years ago

I changed the title and added a TL;DR with the table at the beginning.

I wanted to use "...how many clusters should you have?" for the title, but it's 69 characters and only 65 are allowed.

A colour palette is what I was looking for for a long time. Couldn't we define a common colour palette that we use for all diagrams?

Otherwise, the article should be good to be published.

danielepolencic commented 4 years ago

I think the title could be: Architecting Kubernetes clusters — how many should you have? as the word cluster is repeated.

learnk8s / learnk8s.io

Cluster size blog post #280