status-im / swarms

Swarm Home. New, completed and in-progress features for Status
92 stars 31 forks source link

Clean up and rebuild our cluster #69

Closed adambabik closed 6 years ago

adambabik commented 6 years ago

Preamble

Idea: #69-cluster-rebuild
Title: Clean up and rebuild our cluster
Status: Draft
Created: 2018-01-15

Summary

Our current cluster is in bad shape and needs to be rebuilt in order to support other activities like Continous Deliver (#64).

Swarm Participants

Product Overview

The goal of this swarm is to fix the state of our cluster. After changes, we expect to have two production clusters and one dev cluster. Also, each cluster will have regular bootnodes and will be easy to extend with new nodes. Finally, each server will have monitoring set up.

Product Description

We need two production clusters to support the current and planned releases. Often, when we upgrade go-ethereum dependency in status-go, a geth version used by the cluster also needs to be upgraded. Sometimes, these upgrades are not backward compatible. Two clusters allow us to keep the current and the next releases working.

Dev cluster is required for testing new go-ethereum versions.

Also, we need to create real bootnodes in order to gain the flexibility of adding and removing nodes. In our configuration files, we should only put bootnodes, not all existing nodes in the cluster.

Finally, we need to have some basic node monitoring. Currently, we have no idea how much CPU is used or what's disk usage. It may turn out that we don't need such powerful servers or that we can get rid of some nodes as they are not necessary.

Requirements & Dependancies

TBD

Minimum Viable Product

Goal Date: TBD

Description: Dev cluster with bootnode and monitoring

  1. There is a single bootnode to which all nodes in dev cluster connects,
  2. Connecting to the bootnode populate node's peers successfully using discovery protocol,
  3. There is a single Ansible command to upgrade/downgrade geth version of nodes in the cluster,
  4. There are server metrics (CPU, mem and disk usage).

Dates

Goal Date: TBD

Description: Prod cluster for the current release

  1. A separate cluster dedicated for production use,
  2. It has the same features as described in MVP.

Testing Days required: 3

Goal Date: TBD

Description: Prod cluster for the next release

  1. A separate cluster dedicated for production use,
  2. It has the same features as described in MVP.

Testing Days required: 3

Success Metrics

  1. We are able to manage clusters using Ansible,
  2. Easier configuration of bootnodes in the Status app,
  3. We can dynamically change the number of nodes in clusters without changing the configuration in the app.

Copyright

Copyright and related rights waived via CC0.

adambabik commented 6 years ago

Case study for the dev cluster:

  1. We identify an issue like this: https://github.com/status-im/status-react/issues/3049,
  2. We run the tests on an isolated dev cluster (there are only peers from within the cluster),
  3. We collect peers and messages information.

After that, we should be able to:

  1. Draw the state of the cluster,
  2. Inspect each message flow.
dshulyak commented 6 years ago

@adambabik May I ask why do you want to use ansible? I know that it works pretty well, especially for people without ops background. But what is the reason to use such tool at all?

The majority of applications can be automated just by using k8s/docker swarm. For example, with k8s it will be super easy to integrate new applications (monitoring/ ipfs nodes or eth swarm), add a rolling update for them, downgrade/upgrade by changing a single field in yaml file. And the process would be the same on your local machine and remote cluster. We can even reuse the same set of machines for 3/n different clusters by deploying them in isolated namespaces. From the first glance, the downgrade might be a problem, but only if the chain schema is changed, I believe it doesn't happen very often.

The biggest downside is the cost of deployment and management of k8s cluster, but if using GKE is an option I would consider evaluating it as status cluster management solution. Also, maybe there is some product from AWS as well.

mandrigin commented 6 years ago

+1 for k8s, it is a pretty much becoming an industry standard for cluster management at the moment.

UPD: AWS supports managed K8S

adambabik commented 6 years ago

Probably term "cluster" in this case is not the most fortunate. We should call it a swarm maybe?

I don't see how any of k8s features that are not supported by Docker can help us. k8s would introduce a huge complexity and it was built with totally different objectives than decentralized Ethereum network has. GKE simplifies managing k8s cluster greatly but then we make ourself dependent on a single provider and basically make all our nodes centralized.

I think we can definitely utilize Docker and docker-compose to run Ethereum nodes and other services like monitoring. However, we won't use 95% of k8s features in this work so it does not make sense to use such a complicated system. In the future, Status should work just fine even if the majority of our nodes are down or even all of them. By making our nodes centralized, we can accidentally start building stuff that does not work and is hard to test in a decentralized case.

I see our nodes as totally independent servers that do not have any assumptions and do now know about each other.

I would give thumbs up for Docker and docker-compose but k8s seems like an overkill.

@v2nek, would you like to comment on k8s?

oskarth commented 6 years ago

I don't see how any of k8s features that are not supported by Docker can help us. k8s would introduce a huge complexity and it was built with totally different objectives than decentralized Ethereum network has. GKE simplifies managing k8s cluster greatly but then we make ourself dependent on a single provider and basically make all our nodes centralized.

100%.

The default question should be: why k8s, not why Ansible? Since the former is a lot more complex and has way more moving parts. Having worked in production environments where this type of tech was used, it definitely has its uses but it is also non-trivial in terms of setup and maintenance.

The starting point should be what problems we are actually having, which I believe @adambabik has the most context for in terms of status-cluster.

adambabik commented 6 years ago

And one more thing about Ansible vs k8s/Docker. Ansible is just a tool to automate stuff. It still can be very useful when one sets up a k8s cluster on-premises. In our setup, Ansible would be useful to just provision servers but how Ethereum nodes are run is a different question and Docker or docker-compose can definitely be used.

dshulyak commented 6 years ago

GKE simplifies managing k8s cluster greatly but then we make ourself dependent on a single provider and basically make all our nodes centralized.

GKE provides standard k8s API. If we will want to switch to our own k8s cluster or one from AWS there won't be lots of issues. Potentially we may have to change the driver behind persistent volumes, but it should be very easy change. So, it won't be a vendor lock in.

Also, i dont understand how using k8s/docker-swarm will make our nodes more or less centralized. To me it looks exactly the same as using our own set of servers but in more efficient way.

The default question should be: why k8s, not why Ansible?

It simplifies deployment, upgrades and provides high availability out of the box for latency insensitive apps. From the issue description, this looks like the main problem of status cluster atm.

As, for complexity, what would happen if the server, where bootnode runs will fail?

I see our nodes as totally independent servers that do not have any assumptions and do now know about each other.

How using docker-swarm/k8s changes it?

Having worked in production environments where this type of tech was used, it definitely has its uses but it is also non-trivial in terms of setup and maintenance.

I agree, I have experience providing k8s clusters on-premise, and I would prefer to use one from GKE or AWS if possible.

adambabik commented 6 years ago

Potentially we may have to change the driver behind persistent volumes, but it should be very easy change. So, it won't be a vendor lock in.

Persistent volumes are actually not trivial to change. There is a lot of drivers and you really need to know what you're doing.

Also, i dont understand how using k8s/docker-swarm will make our nodes more or less centralized. For me, it looks exactly the same as using our own set of servers but in a more efficient way.

The way that we need to put all our servers under a governance of k8s. If we need to run servers somewhere else, in a different provider datacenter, we need to put it into k8s cluster or we need to have yet another way of managing it. Thus, it would be simpler to have a more general way of doing that.

It simplifies deployment, upgrades and provides high availability out of the box for latency insensitive apps.

Deployment and upgrades can be easily achieved with Docker without using such a complex thing as k8s. I don't see much difference between executing a remote command via Ansible or kubectl.

High availability in Ethereum network is achieved by having multiple peers in different zones. It does not rely on HA of particular nodes.

I would suggest reverting the discussion and ask what features that are not available in Docker/docker-compose does k8s offers and are necessary for us?

As, for complexity, what would happen if the server, where bootnode runs will fail?

If all bootnode servers used by a given node are unavailable, it won't connect to the Ethereum network. Thus, usually, there should be at two or three bootnodes configured.

I agree, I have experience providing k8s clusters on-premise, and I would prefer to use one from GKE or AWS if possible.

I think it's not an option unless something changed. We talked about moving to AWS and it was ruled out.

dshulyak commented 6 years ago

I would suggest reverting the discussion and ask what features that are not available in Docker/docker-compose does k8s offers and are necessary for us?

When you are saying docker-compose, you mean - with docker-swarm or without?

dshulyak commented 6 years ago

Deployment and upgrades can be easily achieved with Docker without using such a complex thing as k8s. I don't see much difference between executing a remote command via Ansible or kubectl. High availability in Ethereum network is achieved by having multiple peers in different zones. It does not rely on HA of particular nodes. I would suggest reverting the discussion and ask what features that are not available in Docker/docker-compose does k8s offers and are necessary for us?

I think as a team grows and more services will be created on the infrastructure it makes sense to separate concerns. As a developer I don't usually care where are my nodes and how to access them, the only thing is the number of resources for running apps. Both docker swarm and k8s can provide this abstraction. Also, any developer will be able to test deployment of an app locally, without creating VMs for ansible deployment.

Another thing is dynamic scaling of the cluster and better utilization of resources. Both with docker-swarm and kubernetes it is easy to rescale cluster according to current needs and place more than one replica of an app on the same server.

Also, kubectl can be used not only to deploy an app: troubleshooting, logs collection, and health tracking. And all of that from a single place, no need to ssh to every server to understand what goes wrong.

If all bootnode servers used by a given node are unavailable, it won't connect to the Ethereum network. Thus, usually, there should be at two or three bootnodes configured.

K8S can help here for example. If bootnode became unreachable - it will reschedule container on another pod, preserving its ip (service ip won't be changed).

I think it's not an option unless something changed. We talked about moving to AWS and it was ruled out.

If so maybe docker-swarm makes more sense as it is easier to setup and maintain.

adambabik commented 6 years ago

When you are saying docker-compose, you mean - with docker-swarm or without?

Without, just plain docker-compose.

I agree with all advantages. If we build regular microservices I would definitely go for k8s. However, in our case, we may hit very unexpected obstacles and introduce complexity that is simply unnecessary. Lots of these things can be done with docker-compose as well.

docker-swarm may be actually a good tradeoff between k8s and docker-compose as it's simpler but still can help with managing a cluster.

I guess we can pause this discussion for now :) It was very enlightening and we collected a lot of pros and cons!

oskarth commented 6 years ago

Closing this issue as part of spring cleaning. If this idea is still relevant, please submit a PR per https://github.com/status-im/ideas/#contributing. If you feel closing this issue is a mistake, feel free to re-open.