runatlantis / atlantis

Terraform Pull Request Automation
https://www.runatlantis.io
Other
7.82k stars 1.06k forks source link

"atlantis apply" intermittently gets stuck when running terraform that opens github PRs #4892

Open transient1 opened 2 months ago

transient1 commented 2 months ago

Community Note


Overview of the Issue

We have 13 terraform files that set up/control 13 repositories. These files are all in a "parent" repo. Each file in the parent repo defines a git_commit changeset (using this provider https://github.com/arl-sh/terraform-provider-git) and a github_repository_pull_request using the official github terraform provider. Each of the target repos has an atlantis.yaml file at the root of the repo that points to a directory that will holds terraform. The parent repo also has an atlantis.yaml file at its root that points to the current directory.

When we open a PR in the parent repo, atlantis plan runs and completes. Then we comment atlantis apply. At this point the atlantis user, which has a github PAT that gives it permissions to the target repos, runs terraform that is supposed to open PRs against the target repos, where the commit consists of the files designated in the git_commit_changeset resource. Sometimes this works without a hitch. Other times in the atlantis ui for the parent repo we can see entries like the following

github_repository_pull_request.${PR1 NAME}: Still creating.... [16m30s elapsed]
github_repository_pull_request.${PR2 NAME}: Still creating.... [35m20s elapsed]
...

It hangs repeating this message (with incrementing times) for every target repo until we force restart the statefulset.

Reproduction Steps

This might be an issue of scale so not sure if it can be easily reproduced. But essentially you'd need a setup like the above where you have one repo responsible for having atlantis run terraform that opens PRs against a number of target repos.

Logs

Logs ``` {"level":"warn","ts":"2024-09-03T19:39:41.554Z","caller":"events/events_controller.go:747","msg":"payload signature check failed","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:747\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).handleGithubPost\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:161\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).Post\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:104\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/urfave/negroni/v3.(*Negroni).UseHandler.Wrap.func1\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2938\nnet/http.(*conn).serve\n\tnet/http/server.go:2009"} ``` The above is the only thing we see in the logs. ### Environment details - Atlantis version: v0.25.0 - Deployment method: ArgoCD (templated Helm manifests) - If not running the latest Atlantis version have you tried to reproduce this issue on the latest version: no - Atlantis flags: - name: ATLANTIS_FAIL_ON_PRE_WORKFLOW_HOOK_ERROR value: "true" - name: ATLANTIS_GH_ORG value: REDACTED - name: ATLANTIS_HIDE_PREV_PLAN_COMMENTS value: "true" - name: ATLANTIS_LOG_LEVEL value: info - name: ATLANTIS_SILENCE_ALLOWLIST_ERRORS value: "true" - name: ATLANTIS_SILENCE_NO_PROJECTS value: "false" - name: ATLANTIS_SILENCE_VCS_STATUS_NO_PLANS value: "false" - name: GITHUB_OWNER value: REDACTED - name: TF_CLI_CONFIG_FILE value: REDACTED - name: ATLANTIS_ENABLE_DIFF_MARKDOWN_FORMAT value: "true" - name: ATLANTIS_DATA_DIR value: /atlantis-data - name: ATLANTIS_REPO_ALLOWLIST value: REDACTED - name: ATLANTIS_PORT value: REDACTED - name: ATLANTIS_REPO_CONFIG value: REDACTED - name: ATLANTIS_ATLANTIS_URL value: REDACTED - name: ATLANTIS_GH_USER value: REDACTED - name: ATLANTIS_GH_TOKEN valueFrom: secretKeyRef: REDACTED - name: ATLANTIS_GH_WEBHOOK_SECRET valueFrom: secretKeyRef: REDACTED Atlantis server-side config file: Nothing here but pre-workflow hooks to copy necessary secrets and tokens from vault Repo `atlantis.yaml` file: ```yaml version: 3 automerge: true projects: - name: REDACTED dir: "./" workspace: "default" ``` We're running Atlantis as a statefulset in a Kubernetes cluster. Due to our setup it is possible for multiple people to be working on the same parent repo and attempting to run atlantis at the same time, wherein atlantis will respond that it can't run an apply because another PR has the lock. When that occurs we either wait until the other PR has been applied and merged, or we run `atlantis unlock` on the other PR and then run the one we want. Not sure if this can be a contributing factor. Terraform state is kept in an S3 bucket. ### Additional Context
anryko commented 1 month ago

This doesn't look like an Atlantis issue to me. Atlantis is just executing the terraform code of yours which is utilising the above mentioned terraform-provider-git provider. The issue must be in the provider which is hanging on failed interaction with the Github API.