Open transient1 opened 2 months ago
This doesn't look like an Atlantis issue to me. Atlantis is just executing the terraform code of yours which is utilising the above mentioned terraform-provider-git provider. The issue must be in the provider which is hanging on failed interaction with the Github API.
Community Note
Overview of the Issue
We have 13 terraform files that set up/control 13 repositories. These files are all in a "parent" repo. Each file in the parent repo defines a git_commit changeset (using this provider https://github.com/arl-sh/terraform-provider-git) and a
github_repository_pull_request
using the official github terraform provider. Each of the target repos has an atlantis.yaml file at the root of the repo that points to a directory that will holds terraform. The parent repo also has an atlantis.yaml file at its root that points to the current directory.When we open a PR in the parent repo,
atlantis plan
runs and completes. Then we commentatlantis apply
. At this point the atlantis user, which has a github PAT that gives it permissions to the target repos, runs terraform that is supposed to open PRs against the target repos, where the commit consists of the files designated in thegit_commit_changeset
resource. Sometimes this works without a hitch. Other times in the atlantis ui for the parent repo we can see entries like the followingIt hangs repeating this message (with incrementing times) for every target repo until we force restart the statefulset.
Reproduction Steps
This might be an issue of scale so not sure if it can be easily reproduced. But essentially you'd need a setup like the above where you have one repo responsible for having atlantis run terraform that opens PRs against a number of target repos.
Logs
Logs
``` {"level":"warn","ts":"2024-09-03T19:39:41.554Z","caller":"events/events_controller.go:747","msg":"payload signature check failed","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:747\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).handleGithubPost\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:161\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).Post\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:104\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/urfave/negroni/v3.(*Negroni).UseHandler.Wrap.func1\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2938\nnet/http.(*conn).serve\n\tnet/http/server.go:2009"} ``` The above is the only thing we see in the logs. ### Environment details - Atlantis version: v0.25.0 - Deployment method: ArgoCD (templated Helm manifests) - If not running the latest Atlantis version have you tried to reproduce this issue on the latest version: no - Atlantis flags: - name: ATLANTIS_FAIL_ON_PRE_WORKFLOW_HOOK_ERROR value: "true" - name: ATLANTIS_GH_ORG value: REDACTED - name: ATLANTIS_HIDE_PREV_PLAN_COMMENTS value: "true" - name: ATLANTIS_LOG_LEVEL value: info - name: ATLANTIS_SILENCE_ALLOWLIST_ERRORS value: "true" - name: ATLANTIS_SILENCE_NO_PROJECTS value: "false" - name: ATLANTIS_SILENCE_VCS_STATUS_NO_PLANS value: "false" - name: GITHUB_OWNER value: REDACTED - name: TF_CLI_CONFIG_FILE value: REDACTED - name: ATLANTIS_ENABLE_DIFF_MARKDOWN_FORMAT value: "true" - name: ATLANTIS_DATA_DIR value: /atlantis-data - name: ATLANTIS_REPO_ALLOWLIST value: REDACTED - name: ATLANTIS_PORT value: REDACTED - name: ATLANTIS_REPO_CONFIG value: REDACTED - name: ATLANTIS_ATLANTIS_URL value: REDACTED - name: ATLANTIS_GH_USER value: REDACTED - name: ATLANTIS_GH_TOKEN valueFrom: secretKeyRef: REDACTED - name: ATLANTIS_GH_WEBHOOK_SECRET valueFrom: secretKeyRef: REDACTED Atlantis server-side config file: Nothing here but pre-workflow hooks to copy necessary secrets and tokens from vault Repo `atlantis.yaml` file: ```yaml version: 3 automerge: true projects: - name: REDACTED dir: "./" workspace: "default" ``` We're running Atlantis as a statefulset in a Kubernetes cluster. Due to our setup it is possible for multiple people to be working on the same parent repo and attempting to run atlantis at the same time, wherein atlantis will respond that it can't run an apply because another PR has the lock. When that occurs we either wait until the other PR has been applied and merged, or we run `atlantis unlock` on the other PR and then run the one we want. Not sure if this can be a contributing factor. Terraform state is kept in an S3 bucket. ### Additional Context