Replace the Kubernetes cluster

srobo / infrastructure

Obsolete. Provisions the base infrastructure in DigitalOcean

MIT License

0 stars 0 forks source link

Replace the Kubernetes cluster #14

Closed RealOrangeOne closed 2 years ago

RealOrangeOne commented 3 years ago

Background: The website, docs, and everything else under studentrobotics.org is proxied through nginx, which runs on a single-node kubernetes cluster with a load balancer in front of it (to handle TLS).

Kubernetes is massively overkill for our hosting needs at the moment, especially for nginx. It's also more expensive than it needs to be. Since its deployment, it (kubernetes) has rarely been touched or maintained.

Instead, I think we should replace this with a single droplet which runs nginx. This is much simpler, easier to manage, and cheaper to run. DO Apps is also a possibility, but makes things harder if we want to run anything else alongside.

I've started working on a ansible playbook which could be used to deploy this, it's just missing the actual nginx configuration. My intention is that nginx runs on the host, with TLS handled by certbot. The nginx config doesn't change often, so is unlikely to need automatic deployment.

Interested to hear what others think. Very happy to hear some other ideas!

Tyler-Ward commented 3 years ago

It might be worth considering setting up the nginx redirection service as a docker container (we can just run with docker-compose rather than needing any more complicated orchestration) that way the server can also host other containerised services without the server becoming another monolith and allowing for easier testing of the nginx redirector without needing to build entire copies of the server.

In terms of initial setup we can just expose the nginx containers web ports directly rather than needing any other containers in front of it.

RealOrangeOne commented 3 years ago

I was also considering this idea, as it would make running future bits of software on the same box significantly simpler. Fortunately because the current setup runs in k8s, it's already dockerized, and we did some work a while ago to separate it from the website repo into https://github.com/srobo/reverse-proxy (although not currently deployed).

Would this container be the single entrypoint to any other web services on the box (and thus handle TLS termination), or should we put something in front of it (Eg Traefik) to handle TLS termination and perhaps any other future applications. If we're going full docker box, I'd be tempted to say the latter?

Tyler-Ward commented 3 years ago

If we are likely to end up with services that don't need to go through the Nginx e.g on their own subdomain or simple url pattern e.g. "/ide" then adding Traefik now would make sense. If SSL termination isn't currently handled by Nginx (I think it was handled by the load balancer) then adding traefik even just to deal with that might also make sense if it makes the nginx config easier to understand.

In addition the newer traffic versions can also fairly easily route to external services via the file config system if needed so could move some simple routing cases out of nginx later if we end up with more in traefik and want to leave nginx to just handle the complicated cases.

Tyler-Ward commented 3 years ago

If going down this route I have put some notes on my implementation of this below if it is of any use (Most of my servers use dockeised services behind traefik).

I put all the container configs in /docker (although any directory could be used) and store the configuration for all the containers in git (e.g. their docker-compose files and any config files) I then have ansible checkout that gitrepo and then use docker-compose to start all of the containers. a minimal ansible config is included below.

- name: Install git
  apt:
    name: git
    state: present

- name: get container configuration
  git:
    repo: <repo name here>
    dest: /docker

- name: setup containers
  docker_compose:
    project_src: "{{item}}"
  with_items:
    # setup utility containers first
    - "/docker/traefik"
    # setup remaining containers`
    `- "/docker/wiki"`

PeterJCLaw commented 3 years ago

The nginx config doesn't change often, so is unlikely to need automatic deployment.

I think there's a false premise here -- that frequent changes are the main reason to automate deployment. I think there's a strong argument for automating deployment to provide reproducibility, ensure deployment doesn't feel like a chore and to reduce bus-factor.

I would argue that we have already seen that requiring manual deployment of the nginx config has been a blocker to various things and a step backwards from our legacy all-puppet automatic deployment.

PeterJCLaw commented 3 years ago

Regarding the other topics being discussed here, I don't have strong opinions over exactly which technology we use, though I do think it needs to be as boring as possible. Having something which is well understood (in general), enables easy testing (of all the relevant parts together as well as sub-units) and contribution is more important than picking the newest shiniest tool.

In that regard, I would bias strongly towards something like nginx (mature, well known and understood) over traefik (so small/new it doesn't have a wikipedia page). This doesn't preclude using less-boring choices, however we should be mindful of how we spend our innovation tokens.

I also think it's important to start with requirements rather than solutions. Are we actually going to need to run several services per VM? Does doing so actually simplify things for us? How do you test all the separateed services together? Is it going to be easier for other teams to just be given a VM or for them to provide containers?

If we are going to want to run several containers, what do we need to be able to do to orchestrate them? (What are the reaons that using kubernetes is the wrong tool?)

Finally I think we need to consider how much time we have and how that compares to the apparent cost savings of moving away from the current solution (both in terms of the time to do the migration and the ongoing time-cost of the new solution vs the existing one). If the time is limited (it is) and the time/cost saving is small (it seems like it is) then an investment in changing it may not be an effective use of our time. In general rewrites take several multiples longer than expected and fix a fraction of the issues and we should be mindful of this.

Tyler-Ward commented 3 years ago

Are we actually going to need to run several services per VM? Does doing so actually simplify things for us? How do you test all the separateed services together? Is it going to be easier for other teams to just be given a VM or for them to provide containers?

If we are expecting all the other teams to fully sysop their own environments then just giving them a VM will be easier for us and the proposed systems above will still allow this, however that relies on each of the other teams having a decent number of computer science or similar members to maintain those going forwards and partialy defats the purpose of centralising the server and system management experience into the infrastructure team. For areas like the competition VM keeping with the current separate VM makes sense on the short term, However if a team or srobo in general just needs a single small service (e.g. hosting a wiki, or the new inventory system) it could be easily hosted on the same box without the effort and cost to maintain another vm.

If we are going to want to run several containers, what do we need to be able to do to orchestrate them? (What are the reaons that using kubernetes is the wrong tool?)

A fair question. Kubernetes is designed for handling large deployments over lots of hosts and with lots of clusters of services, therefore it has lots of power but is complicated and therefore takes more effort to setup deploy and manage. Docker-compose which is what I use is much simpler (at the cost of much reduced functionality) and therefore fairly easy to for new people looking at the setup to understand in a short amount of time.

Re cost savings, if I remember correctly a digital ocean managed kuberneties cluster requires 3 hosts and a load balancer each at £10 per month (assuming the smallest droplet size). This is not an insignificant saving.

PeterJCLaw commented 3 years ago

If we are expecting all the other teams to fully sysop their own environments then just giving them a VM will be easier for us

Yeah, this is roughly the direction I think we should be heading.

that relies on each of the other teams having a decent number of [sysops people] and partialy defats the purpose of centralising the server and system management experience into the infrastructure team.

Ah, I think possibly there's a misunderstanding about the role of the infra team. My understanding is not that the team exists to do all the sysops stuff for all teams, but rather that it exists to ensure there is someone to look after the common bits of SR infrastructure (rather than that falling to the trustees). Mostly this is picking up things which no-one owned before, rather than centralising stuff which was already owned.

The team is likely to have a concentration of expertise and is expected to support other teams where that's reasonable (and to ensure suitable security/access control/continuance of access measures in place), but does not exist primarily to provide hosting for other teams' stuff. Apologies if we gave the wrong impression in our meeting last week.

If it becomes the case that the infra team can take on a small amount of work to centralise something that other teams are all having to do themselves then it may make sense for it to do that, however I'd rather wait until that was needed and the team knew it had capacity than invest in pre-empting the need.

If we are going to want to run several containers, what do we need to be able to do to orchestrate them? (What are the reaons that using kubernetes is the wrong tool?)

A fair question. Kubernetes is designed for handling large deployments over lots of hosts and with lots of clusters of services, therefore it has lots of power but is complicated and therefore takes more effort to setup deploy and manage. Docker-compose which is what I use is much simpler (at the cost of much reduced functionality) and therefore fairly easy to for new people looking at the setup to understand in a short amount of time.

I'm assuming that docker-compose can't configure the host? We'd presumably therefore need something (ansible/puppet/whatever) to configure the host and to trigger docker compose, at which point we have two things. This may still be less complexity than Kubernetes, I don't know, but is worth considering as part of the comparison.

Re cost savings, if I remember correctly a digital ocean managed kuberneties cluster requires 3 hosts and a load balancer each at £10 per month (assuming the smallest droplet size). This is not an insignificant saving.

From what I can tell we're running quite a lot less than that, so the saving isn't as much as it may appear. Certainly the cost saving alone doesn't feel like to me like it's worth a lot of effort. (I'll put the details in Slack so they're not quite as public as GitHub is)

PeterJCLaw commented 3 years ago

Re cost savings, if I remember correctly a digital ocean managed kuberneties cluster requires 3 hosts and a load balancer each at £10 per month (assuming the smallest droplet size). This is not an insignificant saving.

From what I can tell we're running quite a lot less than that, so the saving isn't as much as it may appear. Certainly the cost saving alone doesn't feel like to me like it's worth a lot of effort. (I'll put the details in Slack so they're not quite as public as GitHub is)

In adding this to Slack I've re-checked the numbers for the current list prices of what we're using and there is more here than I'd thought, so I am now finding this aspect more convincing. (For reference though, we do only have one droplet here; it's the manage services which bump the cost quite a bit)

RealOrangeOne commented 3 years ago

There is definitely an argument for "If we're just running 1 droplet with some docker containers, why not just run Kubernetes", but I think for the short term, swapping to a VM is significantly simpler (read: "boring"). If we wanted to actually use Kubernetes, I think it'd warrant a ground-up redesign vs what we have now anyway.

I'm assuming that docker-compose can't configure the host

Correct. But whatever tool we use on top of that (eg ansible) can provision the compose files and deploy them as necessary (imagine them as systemd service files).

If we are expecting all the other teams to fully sysop their own environments then just giving them a VM will be easier for us

This is my understanding of our role, yes. In theory if a team just needs a single application run, then having a shared server to throw it on would save us money. It's also worth noting that given we're operating in a "hosting provider" setting as opposed to "service provider", it's significantly more likely that a volunteer can configure a system than configure a k8s namespace.

Assuming that the applications we run are dockerized may be one we can make in future, but definitely isn't the case now. Whereas we can assume that certain requests will require a dedicated VM, at least on the short term.

My assumption is that we provide 2 things in this department:

A layer between volunteers and provisioning of infrastructure. This layer serves as permissions, security, and advisory.
A consultation service for teams to get input on infrastructure setups to assist with provisioning, development and maintenance, for when a team may not have the experience or knowledge.
A verification service. Making sure that what SR does run is secure, updated, maintained and follows a standard (industry and SR)

Finally I think we need to consider how much time we have and how that compares to the apparent cost savings

Given the fragileness of the current cluster, I think the time investment is definitely worth it. Having a platform we (collective volunteers) can more easily and fearlessly maintain, and one that we (infra team) understand are 2 very useful metrics.

Regarding time, yes there is a cost, but if it's something people find fun (which I for one do), then it can often cost more "time" trying to work with the existing setup.

PeterJCLaw commented 2 years ago

The more I think about our use current use case, the more I'm tempted to just push towards putting everything into Ansible as the base layer. From experience, Ansible feels better than puppet and is much simpler than terraform/k8s.

While there may eventually be a point where we want to support a more containerised approach, our current setup is pretty minimal and just running direct in VMs a setup we already have working and know works.

I think the challenge with something like Ansible for the main NGINX config is whether we want to require a member of the infra team to be involved in merging NGINX routing changes (in terms of approval/visibility not manual steps). I think we would want to require an infra-team member's approval on PRs which changed actual machine configuration, however it's less clear that that needs to be the case for configuring NGINX.

I suppose one approach is to require it to start with and see how much of a blocker it is in practise, then respond accordingly. A setup where NGINX is in a container run directly under docker on an ansible managed host could achieve this (hopefully without too much faff), however that does still leave us with multiple layers of things.

PeterJCLaw commented 2 years ago

This happened. monty.studentrobotics.org is the new host, configured via https://github.com/srobo/ansible/.