runatlantis / atlantis

Terraform Pull Request Automation
https://www.runatlantis.io
Other
7.82k stars 1.06k forks source link

Support executing terraform commands via k8s jobs #3791

Open george-zubrienko opened 1 year ago

george-zubrienko commented 1 year ago

Community Note


Describe the user story I have seen #260 running for a while with multiple attempts at resolving the issue with project locality in Atlantis. Our organization also has issues scaling atlantis as single-node machine hosting it is not able to cope with all the applies/plans coming in. There are also issues with provider cache sharing, version locking, plan storage and resource usage when both webserver and terraform itself runs in the same container/pod.

This can be partially worked around by installing several atlantis charts, not using a monorepo and configuring a new webhook for each, but that process has limits to its scalability as number of projects and resources in them will anyway exceed number of repos, unless you do a repo-per-project.

I want to propose a solution to this, as an opt-in feature, that should allow Atlantis to scale horizontally for larger organization and teams than the current solution is realistically feasible to work with without installing several charts/webhooks.

Describe the solution you'd like I propose an option to return a specialised CommandRunner - looking at this code piece. I think it should be feasible to run that code in 2 modes:

in values.yaml this could be smth like:

remote-execution:
  enabled: true
  stateVolumeName: my-volume
  workerAffinity: {}
  workerTolerations: {}
  workerResources: {}
  workerLimits: {}

if enabled, this flag should also add a CRD to the namespace atlantis is installed in:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: worker-job.atlantis.runatlantis.io
spec:
  group: atlantis.runatlantis.io
  scope: Namespaced
  names:
    plural: worker-jobs
    singular: worker-job
    kind: AtlantisWorker
    shortNames:
      - aw
  versions:
    - name: v1beta1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                metadata:
                  type: object
                  properties:
                    labels:
                      type: object
                      additionalProperties:
                        type: string
                      nullable: true
                    annotations:
                      type: object
                      additionalProperties:
                        type: string
                      nullable: true
                  description: Job metadata
                template:
                  type: object
                  x-kubernetes-embedded-resource: true
                  x-kubernetes-preserve-unknown-fields: true

This will allow storing whole k8s job template on the cluster. Now the event handling flow will require an adjustment, I tried to put this into a mermaid:

graph TD;  
    A[event] --> B{remote execution enabled?};  
    B -->|No| C[local runner];  
    B -->|Yes| X;  
    X --> D[Prepare EventContext];  
    X --> E[resolve CommandType];  
    X --> F[read job template in RemoteJob object];  
    D --> G[set `cmd` and `args` in container spec];  
    E --> G;  
    F --> G;  
    G --> H[Send generated RemoteJob to the cluster];  
    H --> I[Wait for RemoteJob to complete];  
    I -->|Receive HTTP POST from job| J[Job Completed];  
    I -->|Check status on cluster| J;  
    J --> K[Send a Webhook Event data];  

Note that with the proposed mode (adding a new CommandRunner) worker will be responsible for VCS communication as that class has VcsClient, so maybe actually checking the job result is not needed at all.

Describe the drawbacks of your solution I do not see a lot of challenges with maintaining the k8s integration itself, as Batch API has been very stable and has plenty of features that Atlantis can make use of in the future like suspend could be used for delayed apply. Only detail here is that every minor k8s release brings some exiciting stuff to use, so if Atlantis uses it, we'll be forced to add k8s feature compatibility matrix and decide on how we support different k8s versions and what should be available depending on the version. However, current architecture is not very friendly towards supporting remote exec and it would require some effort to add a new feature so it can work alongside existing functionality, or work selectively based on k8s version of the cluster Atlantis is deployed to.

Then, this adds CRD to the chart which comes with its own fun like migrations. This could be avoided by moving base template to app config instead. However, running an external runner requires an image, and regardless of the route chosen this adds maintenance. If we go with a single image as now, we'll have to add support for "cli" mode on top of current "webserver" mode. If we go with 2 images, this adds a lot of chores to add a second image generation and publish it etc. Plus some people might want to run their own image, so they will come throw issues to support that, so a bit of a pandora box here.

Last, but not least, running an external process in an environment like k8s always come with a cost of investing into bookkeeping. What happens if the job fails to execute the command? How to handle exit 137 or other special exit codes when container might not be able to communicate its status gracefully? Most likely we'll need some sort of "garbage collector", or where I work, we call them "maintainers", which is another app instance that handles this edge cases. Note this is not about removing jobs, as TTL controller handles that no problem, but rather about situations when somebody runs atlantis plan and gets silence in return because the launched job has crashed due to app misconfiguration etc.

Overall, I think all those are manageable, but no doubt this adds a new level of complexity to the app and will require more maintenance than before.

Describe alternatives you've considered The alternatives will always be somewhere around the idea of either multithreading or doing replicaCount: >1. I'd say the latter would be great, if possible and easier to implement compared to k8s jobs.

george-zubrienko commented 1 year ago

fyi @s-vitaliy

jamengual commented 1 year ago

@george-zubrienko thanks for the detailed description, there is a lot to unpack here.

I will recommend looking at https://github.com/lyft/atlantis and their fork to get some ideas, they use Temporal as a scheduler/worker queue type system, maybe there is something there that can be reused and upstreamed back to atlantis.

just a suggestion.

george-zubrienko commented 1 year ago

I'll take a look on the weekend and circle back here, thank you!

Another step would be, from my end, to propose concerete tasks to implement and adjust the list following the discussion.

jamengual commented 1 year ago

yes in a way to build a backlog. We could use a roadmap for that and tag individual issues to the roadmap but that will be after we agree in an architecture

WarpRat commented 1 year ago

We actually recently did this ourselves just using a few bash scripts and redis for passing completed plans and command output back to Atlantis. It's been working well for us and let us greatly reduce the footprint of our Atlantis pod. I had hoped to rewrite the bash in a small go utility and publish the code somewhere public but haven't had a chance with shifting priorities at work, I'd be happy to share some details of how we approached it although it's not much more than a proof of concept currently. This would be a great feature to have available natively in Atlantis.

jamengual commented 1 year ago

that would be great , you can share your idea here so is documented and maybe we can add it later to the help chart or doc site.

On Wed, Oct 18, 2023, 5:50 p.m. WarpRat @.***> wrote:

We actually recently did this ourselves just using a few bash scripts and redis for passing completed plans and command output back to Atlantis. It's been working well for us and let us greatly reduce the footprint of our Atlantis pod. I had hoped to rewrite the bash in a small go utility and publish the code somewhere public but haven't had a chance with shifting priorities at work, I'd be happy to share some details of how we approached it although it's not much more than a proof of concept currently. This would be a great feature to have available natively in Atlantis.

ā€” Reply to this email directly, view it on GitHub https://github.com/runatlantis/atlantis/issues/3791#issuecomment-1769719295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ3ERG5UU6L27FRQNGZO43YAB2MRAVCNFSM6AAAAAA5GVUP3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRZG4YTSMRZGU . You are receiving this because you commented.Message ID: @.***>

george-zubrienko commented 11 months ago

@jamengual sorry for being out a bit, I'm coming back here and I'm going to increase my activity on this one until we are able to have some shape for the solution. I had a bit of a break from OSS since end of October, so hadn't yet look at lyft stuff - will do shortly.

I have a small suggestion/question. I see already several people are able to implement some sort of a solution by adding a "proxy layer" between atlantis and VCS. Thus I had this idea that maybe it will be cheaper if we add some sort of "workload manager" that for Atlantis looks like VCS provider API, but in reality acts as a proxy? That will allow doing multiple replicas on atlantis itself, as long as the proxy can split the work between them?

This will be (potentially) easier to implement and will be fully opt-in based, if proxy is not enabled, people run vanilla mode. Also, this way Atlantis core does not have to be changed at all and thus a lot of work involved in aligning changes with other commits is not required.

UPD: This is somewhat similar to Lyft's Gateway, in case what I wrote is confusing. However I believe this can be simplified if we follow the Atlantis mode (PR -> multiple jobs = TF runs) instead the Lyfts model (revisions -> queue -> TF run)

jamengual commented 11 months ago

Yes, that is a possibility, I guess getting to a POC level and seeing how that could work will be good for understanding the whole flow.

george-zubrienko commented 11 months ago

Alright I'll try to conjure a PoC that will function roughly this way:

It will take a couple of weeks to come with prototype, I have a bit of vacation end of December, so I hope I can present something in Jan 2024 :)

george-zubrienko commented 10 months ago

@jamengual please take a look at rough draft: https://github.com/runatlantis/atlantis/compare/main...SneaksAndData:atlantis:gateway. Note this is my first go project to work on, so if some stuff seems fishy, please point it out :)

Also, disclaimer, by no means this has been tested yet or is a complete version. I send a diff just to probe if the idea resonates with the contributors/community well enough before starting any e2e tests.

This requires one more PR to atlantis Helm chart to change the ingress. TLDR, the implementation is as described below.

Atlantis Job Mode

Optional deployment option of a helm chart. Does not affect or modify any Atlantis code. Diff above contains code for Github VCS only for now.

Job mode changes

Should be enabled from jobMode: enabled in helm values. For now I just provide an example of how the template will look like and the PVC it needs. Enabling it changes the following:

func (ers *DefaultEventRoutingService) RoutePullRequest(target models.PullRequestAssociation, webhookRequest *http.Request) (resp *http.Response, err error) {
    return http.Post(target.AssociatedHost().EventsUrl(), "application/json", webhookRequest.Body)
}

If not, it will create a new Job with Atlantis Server and label/annotation, wait for the pod to come up and route the event there. All events are posted to the channel the host runs and processed sequentially (assuming I get this part of Go correctly :) )

In addition, each Job binds its atlantis-data dir to a dedicated path on a fileshare. As you can see in the job example, I've set deadline to 24 hours - in order to conserve hosts that serve long-living PRs. Storing data in a fileshare allows to recover the PR on another host - again, assuming I understood the core code correctly.

Job mode capabilities

The way it is implemented, only thing job mode does, it provides a way to handle multiple PRs in multiple repos concurrently without running into issues with provider code locks, or performance issues when using a single Atlantis host.

This implementation does not provide any queuing capabilities, so the deployment is still subject to state file lock conflicts, if multiple PR target the same TF dir.

jamengual commented 10 months ago

I'm going on PTO tomorrow, so I will be on/off looking at things but @GenPage @chenrui333 @nitrocode can review this too and have more experience in k8s than me.

pseudomorph commented 9 months ago

I've only skimmed the entirety of this, so I may be missing the whole context. But, would it make more sense to build a remote execution framework (akin to Terraform Cloud worker nodes) which is not strictly tied to kube and build the kube glue around that?

Just in case there are others who might be wanting to use a different compute platform, but achieve the same results.

Apologies if I'm way off base here.

george-zubrienko commented 9 months ago

@pseudomorph that's a reasonable suggestion. My main reason for going this way is that it requires less effort, but still covers the majority of atlantis installations - in the kube. Proxy is not tied to Atlantis as-is and serves just as a request relay layer to go around Atlantis horizontal scalability issues. I'd consider this "v0" implementation that just allows people who install Atlantis in their kubes via helm to get a bit more scalability out of box, if they need it.

When talking a bit more long-term, then a "remote execution framework", either homegrown, or imported would be a more appropriate solution so non-kube cases can be covered as well (Nomad? :D) I'm a total kube-brain and I'm trying to solve our internal issue, but I also feel like we can contribute it to the upstream.

I have now a strict deadline for this issue in my work plan (end of March 2023), so I'll be doing some e2e testing of this one soon.