Closed adambabik closed 6 years ago
Case study for the dev cluster:
After that, we should be able to:
@adambabik May I ask why do you want to use ansible? I know that it works pretty well, especially for people without ops background. But what is the reason to use such tool at all?
The majority of applications can be automated just by using k8s/docker swarm. For example, with k8s it will be super easy to integrate new applications (monitoring/ ipfs nodes or eth swarm), add a rolling update for them, downgrade/upgrade by changing a single field in yaml file. And the process would be the same on your local machine and remote cluster. We can even reuse the same set of machines for 3/n different clusters by deploying them in isolated namespaces. From the first glance, the downgrade might be a problem, but only if the chain schema is changed, I believe it doesn't happen very often.
The biggest downside is the cost of deployment and management of k8s cluster, but if using GKE is an option I would consider evaluating it as status cluster management solution. Also, maybe there is some product from AWS as well.
+1 for k8s, it is a pretty much becoming an industry standard for cluster management at the moment.
UPD: AWS supports managed K8S
Probably term "cluster" in this case is not the most fortunate. We should call it a swarm maybe?
I don't see how any of k8s features that are not supported by Docker can help us. k8s would introduce a huge complexity and it was built with totally different objectives than decentralized Ethereum network has. GKE simplifies managing k8s cluster greatly but then we make ourself dependent on a single provider and basically make all our nodes centralized.
I think we can definitely utilize Docker and docker-compose to run Ethereum nodes and other services like monitoring. However, we won't use 95% of k8s features in this work so it does not make sense to use such a complicated system. In the future, Status should work just fine even if the majority of our nodes are down or even all of them. By making our nodes centralized, we can accidentally start building stuff that does not work and is hard to test in a decentralized case.
I see our nodes as totally independent servers that do not have any assumptions and do now know about each other.
I would give thumbs up for Docker and docker-compose but k8s seems like an overkill.
@v2nek, would you like to comment on k8s?
I don't see how any of k8s features that are not supported by Docker can help us. k8s would introduce a huge complexity and it was built with totally different objectives than decentralized Ethereum network has. GKE simplifies managing k8s cluster greatly but then we make ourself dependent on a single provider and basically make all our nodes centralized.
100%.
The default question should be: why k8s, not why Ansible? Since the former is a lot more complex and has way more moving parts. Having worked in production environments where this type of tech was used, it definitely has its uses but it is also non-trivial in terms of setup and maintenance.
The starting point should be what problems we are actually having, which I believe @adambabik has the most context for in terms of status-cluster.
And one more thing about Ansible vs k8s/Docker. Ansible is just a tool to automate stuff. It still can be very useful when one sets up a k8s cluster on-premises. In our setup, Ansible would be useful to just provision servers but how Ethereum nodes are run is a different question and Docker or docker-compose can definitely be used.
GKE simplifies managing k8s cluster greatly but then we make ourself dependent on a single provider and basically make all our nodes centralized.
GKE provides standard k8s API. If we will want to switch to our own k8s cluster or one from AWS there won't be lots of issues. Potentially we may have to change the driver behind persistent volumes, but it should be very easy change. So, it won't be a vendor lock in.
Also, i dont understand how using k8s/docker-swarm will make our nodes more or less centralized. To me it looks exactly the same as using our own set of servers but in more efficient way.
The default question should be: why k8s, not why Ansible?
It simplifies deployment, upgrades and provides high availability out of the box for latency insensitive apps. From the issue description, this looks like the main problem of status cluster atm.
As, for complexity, what would happen if the server, where bootnode runs will fail?
I see our nodes as totally independent servers that do not have any assumptions and do now know about each other.
How using docker-swarm/k8s changes it?
Having worked in production environments where this type of tech was used, it definitely has its uses but it is also non-trivial in terms of setup and maintenance.
I agree, I have experience providing k8s clusters on-premise, and I would prefer to use one from GKE or AWS if possible.
Potentially we may have to change the driver behind persistent volumes, but it should be very easy change. So, it won't be a vendor lock in.
Persistent volumes are actually not trivial to change. There is a lot of drivers and you really need to know what you're doing.
Also, i dont understand how using k8s/docker-swarm will make our nodes more or less centralized. For me, it looks exactly the same as using our own set of servers but in a more efficient way.
The way that we need to put all our servers under a governance of k8s. If we need to run servers somewhere else, in a different provider datacenter, we need to put it into k8s cluster or we need to have yet another way of managing it. Thus, it would be simpler to have a more general way of doing that.
It simplifies deployment, upgrades and provides high availability out of the box for latency insensitive apps.
Deployment and upgrades can be easily achieved with Docker without using such a complex thing as k8s. I don't see much difference between executing a remote command via Ansible or kubectl.
High availability in Ethereum network is achieved by having multiple peers in different zones. It does not rely on HA of particular nodes.
I would suggest reverting the discussion and ask what features that are not available in Docker/docker-compose does k8s offers and are necessary for us?
As, for complexity, what would happen if the server, where bootnode runs will fail?
If all bootnode servers used by a given node are unavailable, it won't connect to the Ethereum network. Thus, usually, there should be at two or three bootnodes configured.
I agree, I have experience providing k8s clusters on-premise, and I would prefer to use one from GKE or AWS if possible.
I think it's not an option unless something changed. We talked about moving to AWS and it was ruled out.
I would suggest reverting the discussion and ask what features that are not available in Docker/docker-compose does k8s offers and are necessary for us?
When you are saying docker-compose, you mean - with docker-swarm or without?
Deployment and upgrades can be easily achieved with Docker without using such a complex thing as k8s. I don't see much difference between executing a remote command via Ansible or kubectl. High availability in Ethereum network is achieved by having multiple peers in different zones. It does not rely on HA of particular nodes. I would suggest reverting the discussion and ask what features that are not available in Docker/docker-compose does k8s offers and are necessary for us?
I think as a team grows and more services will be created on the infrastructure it makes sense to separate concerns. As a developer I don't usually care where are my nodes and how to access them, the only thing is the number of resources for running apps. Both docker swarm and k8s can provide this abstraction. Also, any developer will be able to test deployment of an app locally, without creating VMs for ansible deployment.
Another thing is dynamic scaling of the cluster and better utilization of resources. Both with docker-swarm and kubernetes it is easy to rescale cluster according to current needs and place more than one replica of an app on the same server.
Also, kubectl can be used not only to deploy an app: troubleshooting, logs collection, and health tracking. And all of that from a single place, no need to ssh to every server to understand what goes wrong.
If all bootnode servers used by a given node are unavailable, it won't connect to the Ethereum network. Thus, usually, there should be at two or three bootnodes configured.
K8S can help here for example. If bootnode became unreachable - it will reschedule container on another pod, preserving its ip (service ip won't be changed).
I think it's not an option unless something changed. We talked about moving to AWS and it was ruled out.
If so maybe docker-swarm makes more sense as it is easier to setup and maintain.
When you are saying docker-compose, you mean - with docker-swarm or without?
Without, just plain docker-compose.
I agree with all advantages. If we build regular microservices I would definitely go for k8s. However, in our case, we may hit very unexpected obstacles and introduce complexity that is simply unnecessary. Lots of these things can be done with docker-compose as well.
docker-swarm may be actually a good tradeoff between k8s and docker-compose as it's simpler but still can help with managing a cluster.
I guess we can pause this discussion for now :) It was very enlightening and we collected a lot of pros and cons!
Closing this issue as part of spring cleaning. If this idea is still relevant, please submit a PR per https://github.com/status-im/ideas/#contributing. If you feel closing this issue is a mistake, feel free to re-open.
Preamble
Summary
Our current cluster is in bad shape and needs to be rebuilt in order to support other activities like Continous Deliver (#64).
Swarm Participants
Product Overview
The goal of this swarm is to fix the state of our cluster. After changes, we expect to have two production clusters and one dev cluster. Also, each cluster will have regular bootnodes and will be easy to extend with new nodes. Finally, each server will have monitoring set up.
Product Description
We need two production clusters to support the current and planned releases. Often, when we upgrade go-ethereum dependency in status-go, a geth version used by the cluster also needs to be upgraded. Sometimes, these upgrades are not backward compatible. Two clusters allow us to keep the current and the next releases working.
Dev cluster is required for testing new go-ethereum versions.
Also, we need to create real bootnodes in order to gain the flexibility of adding and removing nodes. In our configuration files, we should only put bootnodes, not all existing nodes in the cluster.
Finally, we need to have some basic node monitoring. Currently, we have no idea how much CPU is used or what's disk usage. It may turn out that we don't need such powerful servers or that we can get rid of some nodes as they are not necessary.
Requirements & Dependancies
TBD
Minimum Viable Product
Goal Date: TBD
Description: Dev cluster with bootnode and monitoring
Dates
Goal Date: TBD
Description: Prod cluster for the current release
Testing Days required: 3
Goal Date: TBD
Description: Prod cluster for the next release
Testing Days required: 3
Success Metrics
Copyright
Copyright and related rights waived via CC0.