IDEA: Atlantis Kubernetes Operator with HA support

runatlantis / atlantis

Terraform Pull Request Automation

https://www.runatlantis.io

Other

7.67k stars 1.05k forks source link

IDEA: Atlantis Kubernetes Operator with HA support #1428

Open ghostsquad opened 3 years ago

ghostsquad commented 3 years ago

There's a couple "HA" things I'd love to have for Atlantis. 1 - I want webhooks to queue, so that I can restart pods and not miss things from github.

Second, would be leaderelection.

YesYouKenSpace commented 3 years ago

I like this idea of not missing anything during upgrades or rolling deployment of Atlantis too. However, I wonder if a queue is overkill. I wonder if a well-defined shutdown hook coupled with the rolling deployment will be good enough. Purely discussing. Because I really want to have this feature too 😆
Do we need a leader election? Can't we just have one of the pods deal with it and update the stored terraform plan? Also, if I am not wrong we can only achieve HA on a single cluster as of now because Atlantis uses local disk for storage. We need to use external storage if we want to go beyond one cluster. My team deploys on EKS and am using EBS, so HA is impossible beyond one availability zone.

nishkrishnan commented 3 years ago

1) I think a combination of queues for the webhooks and shutdown hooks would be great for Atlantis. Just shutdown hooks wouldn't work because the webhooks would get dropped during a pod restart.

As for leader election, curious on why you think this is necessary. Deployments should take no more than a couple minutes and atlantis is an infrastructure orchestration service so can't think of a need where it needs to be online 100% of the time. Not to mention this is a big endeavour so there needs to be a huge need for something like this before attempting to add it.

ghostsquad commented 3 years ago

Leadership election is actually not too terribly hard in kubernetes.

https://www.github.com/codecentric/kubebuilder-starwars-example/tree/master/vendor%2Fsigs.k8s.io%2Fcontroller-runtime%2Fpkg%2Fleaderelection%2Fleader_election.go

The basic idea of why I brought up leader election is because I want a poddisruption budget for atlantis. We run our clusters on spot instances, and use SpotInst as a cluster autoscaler. It does a good job of bin packing too, which means we have nodes come and go often.

For the queue, you want webhooks to hit the queue as to not be dropped, even during the small time when 2 pods are exchanging leadership responsibilities.