"atlantis apply" intermittently gets stuck when running terraform that opens github PRs

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

We have 13 terraform files that set up/control 13 repositories. These files are all in a "parent" repo. Each file in the parent repo defines a git_commit changeset (using this provider https://github.com/arl-sh/terraform-provider-git) and a github_repository_pull_request using the official github terraform provider. Each of the target repos has an atlantis.yaml file at the root of the repo that points to a directory that will holds terraform. The parent repo also has an atlantis.yaml file at its root that points to the current directory.

When we open a PR in the parent repo, atlantis plan runs and completes. Then we comment atlantis apply. At this point the atlantis user, which has a github PAT that gives it permissions to the target repos, runs terraform that is supposed to open PRs against the target repos, where the commit consists of the files designated in the git_commit_changeset resource. Sometimes this works without a hitch. Other times in the atlantis ui for the parent repo we can see entries like the following

github_repository_pull_request.${PR1 NAME}: Still creating.... [16m30s elapsed]
github_repository_pull_request.${PR2 NAME}: Still creating.... [35m20s elapsed]
...

It hangs repeating this message (with incrementing times) for every target repo until we force restart the statefulset.

Reproduction Steps

This might be an issue of scale so not sure if it can be easily reproduced. But essentially you'd need a setup like the above where you have one repo responsible for having atlantis run terraform that opens PRs against a number of target repos.

Logs

``` {"level":"warn","ts":"2024-09-03T19:39:41.554Z","caller":"events/events_controller.go:747","msg":"payload signature check failed","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:747\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).handleGithubPost\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:161\ngithub.com/runatlantis/atlantis/server/controllers/events.(*VCSEventsController).Post\n\tgithub.com/runatlantis/atlantis/server/controllers/events/events_controller.go:104\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/urfave/negroni/v3.(*Negroni).UseHandler.Wrap.func1\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2938\nnet/http.(*conn).serve\n\tnet/http/server.go:2009"} ``` The above is the only thing we see in the logs. ### Environment details - Atlantis version: v0.25.0 - Deployment method: ArgoCD (templated Helm manifests) - If not running the latest Atlantis version have you tried to reproduce this issue on the latest version: no - Atlantis flags: - name: ATLANTIS_FAIL_ON_PRE_WORKFLOW_HOOK_ERROR value: "true" - name: ATLANTIS_GH_ORG value: REDACTED - name: ATLANTIS_HIDE_PREV_PLAN_COMMENTS value: "true" - name: ATLANTIS_LOG_LEVEL value: info - name: ATLANTIS_SILENCE_ALLOWLIST_ERRORS value: "true" - name: ATLANTIS_SILENCE_NO_PROJECTS value: "false" - name: ATLANTIS_SILENCE_VCS_STATUS_NO_PLANS value: "false" - name: GITHUB_OWNER value: REDACTED - name: TF_CLI_CONFIG_FILE value: REDACTED - name: ATLANTIS_ENABLE_DIFF_MARKDOWN_FORMAT value: "true" - name: ATLANTIS_DATA_DIR value: /atlantis-data - name: ATLANTIS_REPO_ALLOWLIST value: REDACTED - name: ATLANTIS_PORT value: REDACTED - name: ATLANTIS_REPO_CONFIG value: REDACTED - name: ATLANTIS_ATLANTIS_URL value: REDACTED - name: ATLANTIS_GH_USER value: REDACTED - name: ATLANTIS_GH_TOKEN valueFrom: secretKeyRef: REDACTED - name: ATLANTIS_GH_WEBHOOK_SECRET valueFrom: secretKeyRef: REDACTED Atlantis server-side config file: Nothing here but pre-workflow hooks to copy necessary secrets and tokens from vault Repo `atlantis.yaml` file: ```yaml version: 3 automerge: true projects: - name: REDACTED dir: "./" workspace: "default" ``` We're running Atlantis as a statefulset in a Kubernetes cluster. Due to our setup it is possible for multiple people to be working on the same parent repo and attempting to run atlantis at the same time, wherein atlantis will respond that it can't run an apply because another PR has the lock. When that occurs we either wait until the other PR has been applied and merged, or we run `atlantis unlock` on the other PR and then run the one we want. Not sure if this can be a contributing factor. Terraform state is kept in an S3 bucket. ### Additional Context

runatlantis / atlantis