runatlantis / atlantis

Terraform Pull Request Automation
https://www.runatlantis.io
Other
7.83k stars 1.06k forks source link

Unclearable locks with mergeability requirements after upgrade #5048

Open cblkwell opened 3 weeks ago

cblkwell commented 3 weeks ago

Community Note


Overview of the Issue

Yesterday, we upgraded from v0.28.5 to v0.30.0. Afterwards, it appears Atlantis was not respecting the ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY if the apply previously had failed for any reason. We had PRs where someone had tried to apply and it had failed, either because they didn't regenerate a plan after the upgrade (we are not using EFS) or because of some other reason. It then became impossible to apply the PR because the previous apply had failed, leaving the PR unmergeable.

We tried running atlantis unlock and destroying and recreating the ECS service that Atlantis was running under, and neither fixed the issue. Finally, we rolled back to v0.29.0, and this appears to have fixed the issue; PRs once again allowed us to plan and apply as normal.

Reproduction Steps

Running Atlantis v0.30.0 with branch protection requiring atlantis/apply to be successful in order to merge. Fail an apply, then attempt to replan and apply; you should get this error:

Screenshot 2024-10-29 at 9 47 02 AM

Logs

This appears to be the only log message applying to the issue:

Logs ``` {"date":1730209894.274917,"log":"{\"level\":\"error\",\"ts\":\"2024-10-29T13:51:34.274Z\",\"caller\":\"events/instrumented_project_command_runner.go:84\",\"msg\":\"Failure running apply operation: Pull request must be mergeable before running apply.\",\"json\":{\"repo\":\"REDACTED_REPO\",\"pull\":\"3866\"},\"stacktrace\":\"github.com/runatlantis/atlantis/server/events.RunAndEmitStats\\n\\tgithub.com/runatlantis/atlantis/server/events/instrumented_project_command_runner.go:84\\ngithub.com/runatlantis/atlantis/server/events.(*InstrumentedProjectCommandRunner).Apply\\n\\tgithub.com/runatlantis/atlantis/server/events/instrumented_project_command_runner.go:46\\ngithub.com/runatlantis/atlantis/server/events.runProjectCmds\\n\\tgithub.com/runatlantis/atlantis/server/events/project_command_pool_executor.go:48\\ngithub.com/runatlantis/atlantis/server/events.(*ApplyCommandRunner).Run\\n\\tgithub.com/runatlantis/atlantis/server/events/apply_command_runner.go:167\\ngithub.com/runatlantis/atlantis/server/events.(*DefaultCommandRunner).RunCommentCommand\\n\\tgithub.com/runatlantis/atlantis/server/events/command_runner.go:389\"}","container_id":"98c140658909440aab128a263758c770-3286682525","container_name":"atlantis","source":"stderr","ecs_cluster":"atlantis-aws","ecs_task_arn":"arn:aws:ecs:us-east-1:150523386789:task/atlantis-aws/98c140658909440aab128a263758c770","ecs_task_definition":"atlantis-aws:23"} {"date":1730209895.469134,"container_id":"98c140658909440aab128a263758c770-3286682525","container_name":"atlantis","source":"stderr","log":"{\"level\":\"info\",\"ts\":\"2024-10-29T13:51:35.469Z\",\"caller\":\"events/automerger.go:20\",\"msg\":\"not automerging because project at dir \\\"REDACTED_PROJECT\\\", workspace \\\"default\\\" has status \\\"apply_errored\\\"\",\"json\":{\"repo\":\"REDACTED_REPO\",\"pull\":\"3866\"}}","ecs_cluster":"atlantis-aws","ecs_task_arn":"arn:aws:ecs:us-east-1:REDACTED_ACCT:task/atlantis-aws/98c140658909440aab128a263758c770","ecs_task_definition":"atlantis-aws:23"} ```

Environment details

If not already included, please provide the following:

ATLANTIS_VCS_STATUS_NAME = "atlantis-prod"
ATLANTIS_HIDE_PREV_PLAN_COMMENTS = "true"
ATLANTIS_WRITE_GIT_CREDS = "true"
ATLANTIS_SILENCE_NO_PROJECTS = "true"
ATLANTIS_ALLOW_COMMANDS = "version,plan,apply,unlock,approve_policies,import,state"
ATLANTIS_AUTOMERGE = "true"
ATLANTIS_REPO_CONFIG_JSON = "{\"metrics\":{\"prometheus\":{\"endpoint\":\"/metrics\"}},\"repos\":[{\"allowed_custom_workflows\":true,\"allowed_overrides\":[\"apply_requirements\",\"workflow\"],\"apply_requirements\":[\"mergeable\"],\"delete_source_branch_on_merge\":true,\"id\":\"/.*/\",\"import_requirements\":[\"mergeable\"],\"repo_config_file\":\"atlantis-prod.yaml\"}]}"
ATLANTIS_EMOJI_REACTION = "eyes"
ATLANTIS_REPO_ALLOWLIST = REDACTED_REPO
ATLANTIS_EXECUTABLE_NAME = "atlantis-prod"
ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY = "true"
ATLANTIS_USE_TF_PLUGIN_CACHE = "false"
ATLANTIS_DEFAULT_TF_VERSION = "v1.9.8"
ATLANTIS_PORT = "4141"
ATLANTIS_ATLANTIS_URL = REDACTED_URL

Atlantis server-side config file:

repos:
- id: /.*/
  repo_config_file: atlantis-prod.yaml
  apply_requirements: [mergeable]
  import_requirements: [mergeable]
  allowed_overrides: [apply_requirements, workflow]
  allowed_custom_workflows: true
  delete_source_branch_on_merge: true
metrics:
  prometheus:
    endpoint: "/metrics"

Repo atlantis.yaml file:

This is just version: 3 followed by our list of projects.

dosubot[bot] commented 3 weeks ago

Hello, @cblkwell! I'm here to assist you with any bugs, questions, or contributions you may have. Let's work together to resolve your issue.

There are a couple of similar issues that might be relevant to your situation:

  1. Apply fails on Gitlab 16.4.0 with "Pull request must be mergeable before running apply." - This open issue discusses problems with mergeability checks in GitLab, which might have some parallels with your GitHub issue. A suggested workaround involves manually updating the commit status using the GitLab API, which might inspire a similar approach for GitHub [1].

  2. GitHub Mergability check should support ruleset bypasser status - This open issue addresses the need for Atlantis to support certain GitHub ruleset bypass statuses, which could be related to your problem with the ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY flag not functioning as expected [2].

Regarding the ATLANTIS_GH_ALLOW_MERGEABLE_BYPASS_APPLY flag, it is designed to bypass the mergeability check related to the atlantis/apply status. However, it seems that in version 0.30.0, this behavior might not be working as intended if a previous apply operation failed. This could be due to changes in how Atlantis handles mergeability checks or failed apply operations in this version.

Rolling back to version 0.29.0 appears to have resolved the issue for you, indicating a potential regression or change in behavior in version 0.30.0. It might be helpful to monitor the Atlantis repository for any updates or fixes related to this issue.

To continue talking to Dosu, mention @dosu.

stasostrovskyi commented 3 weeks ago

If you are running Atlantis using github app, you may need to give it new permissions - Actions/Read. See https://www.runatlantis.io/docs/access-credentials.html#permissions

cblkwell commented 3 weeks ago

I'll check (I don't have the permissions necessary to do so in our org). Was this something added in 0.30.0? If that's the case, it's probably a good thing to add breaking changes like this to the release notes. I don't see anything in https://github.com/runatlantis/atlantis/releases/tag/v0.30.0 describing this change.

and-ku commented 3 weeks ago

Im having similar issue but mine throws an error heres my Issue https://github.com/runatlantis/atlantis/issues/5038

ajax-ryzhyi-r commented 3 weeks ago

We're experiencing the same issue, despite our GitHub App having all the required permissions. When one of the required checks fails and is then successfully re-run for the same commit, Atlantis still doesn't allow us to apply the PR. Instead, it displays the error mentioned in this issue.

Here's what I found: We have a PR with one required check that passed after a few reruns. It's now in passed state: Screenshot 2024-10-31 at 10 53 15

When we query check runs for the commit as Atlantis does, we see the following response from the GitHub GraphQL API:

 "nodes": [
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "FAILURE"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    }
]

GitHub returns all check runs, including failed ones, in chronological order. After investigating the Atlantis code to understand how it evaluates a commit's PR mergeability status, I found that the issue appears to be in this function: https://github.com/runatlantis/atlantis/blob/main/server/events/vcs/github_client.go#L718-L732

Atlantis iterates through the checkRuns here and determines the required check status based on the first run in the array ignoring the rest of the runs. In our case, it's FAILURE.

The fix seems straightforward. We simply need to evaluate the status of the last checkRun instead of the first one. If the maintainers approve of this proposed solution, I'd be happy to submit a pull request with these changes.

ajax-ryzhyi-r commented 3 weeks ago

We're experiencing the same issue, despite our GitHub App having all the required permissions. When one of the required checks fails and is then successfully re-run for the same commit, Atlantis still doesn't allow us to apply the PR. Instead, it displays the error mentioned in this issue.

Here's what I found: We have a PR with one required check that passed after a few reruns. It's now in passed state: Screenshot 2024-10-31 at 10 53 15

When we query check runs for the commit as Atlantis does, we see the following response from the GitHub GraphQL API:

 "nodes": [
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "FAILURE"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "CANCELLED"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    },
    {
      "__typename": "CheckRun",
      "name": "🔄 Validate pull request",
      "status": "COMPLETED",
      "conclusion": "SUCCESS"
    }
]

GitHub returns all check runs, including failed ones, in chronological order. After investigating the Atlantis code to understand how it evaluates a commit's PR mergeability status, I found that the issue appears to be in this function: https://github.com/runatlantis/atlantis/blob/main/server/events/vcs/github_client.go#L718-L732

Atlantis iterates through the checkRuns here and determines the required check status based on the first run in the array ignoring the rest of the runs. In our case, it's FAILURE.

The fix seems straightforward. We simply need to evaluate the status of the last checkRun instead of the first one. If the maintainers approve of this proposed solution, I'd be happy to submit a pull request with these changes.

@jamengual @nitrocode What do you think?

lukemassa commented 3 weeks ago

👋 @jamengual asked me to help coordinate this issue.

@ajax-ryzhyi-r I agree that's very likely where this logic is happening, and it also seems like that method was introduced in 0.30 so it fits with @cblkwell 's report that downgrading to 0.29 fixes the problem.

I haven't dug into the code yet to tell whether the specific solution you mention will fix the problem, but feel free to submit a PR and I can test and review!

stasostrovskyi commented 3 weeks ago

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

ajax-ryzhyi-r commented 3 weeks ago

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

We're using GitHub Cloud and haven't encountered issues (or haven't noticed) with the atlantis/apply status check. However, we're facing problems with other required checks, specifically GitHub Actions workflow checks. If the required GitHub Actions workflow check fails and we rerun it successfully, Atlantis still treats the PR as unmergeable. I can provide the GraphQL query I used for my investigation if you'd like

ajax-ryzhyi-r commented 3 weeks ago

👋 @jamengual asked me to help coordinate this issue.

@ajax-ryzhyi-r I agree that's very likely where this logic is happening, and it also seems like that method was introduced in 0.30 so it fits with @cblkwell 's report that downgrading to 0.29 fixes the problem.

I haven't dug into the code yet to tell whether the specific solution you mention will fix the problem, but feel free to submit a PR and I can test and review!

Thanks in advance for looking into this issue. I'll try to provide the fix mentioned in my previous comment yesterday 😊

henriklundstrom commented 2 weeks ago

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

We're using GitHub Cloud and haven't encountered issues (or haven't noticed) with the atlantis/apply status check. However, we're facing problems with other required checks, specifically GitHub Actions workflow checks. If the required GitHub Actions workflow check fails and we rerun it successfully, Atlantis still treats the PR as unmergeable. I can provide the GraphQL query I used for my investigation if you'd like

@ajax-ryzhyi-r I'm unable to reproduce the behaviour of the GraphQL API you describe. When I rerun a workflow on the same commit, only the latest run appears in the response. Could you please provide me the query you used?

ajax-ryzhyi-r commented 2 weeks ago

@ajax-ryzhyi-r Are you using github server or cloud? Asking because we weren't able to reproduce the results your graphql query provided and we definitely don't have problems with failed atlantis/apply not being able to be re-applied like in original post. I'm saying we haven't seen this problem because we are the ones who made these changes and have been running them in production on github cloud for better part of the year..

We're using GitHub Cloud and haven't encountered issues (or haven't noticed) with the atlantis/apply status check. However, we're facing problems with other required checks, specifically GitHub Actions workflow checks. If the required GitHub Actions workflow check fails and we rerun it successfully, Atlantis still treats the PR as unmergeable. I can provide the GraphQL query I used for my investigation if you'd like

@ajax-ryzhyi-r I'm unable to reproduce the behaviour of the GraphQL API you describe. When I rerun a workflow on the same commit, only the latest run appears in the response. Could you please provide me the query you used?

@henriklundstrom @stasostrovskyi Sorry for misleading you. Upon reviewing my query, I realized my initial understanding was incorrect. The issue occurs when the workflow is run multiple times for single commits, not when it is rerun for a single commit.

In our case, we have a workflow that validates PRs (checking title, checklists, labels, etc.) with the following triggers:

name: Validate pull request
on:
  pull_request:
    types: [edited, opened, reopened, synchronize]
    branches:
      - master

When this workflow fails, the PR author edits the PR to address the issues, and GitHub automatically triggers the workflow again, creating new run (For some reasonsI thought it reruns failed run ). After these steps, Atlantis is unable to apply the PR, reporting the error mentioned earlier.

However, the proposed fix still appears relevant. I'll only remove the part related to statusContexts, as the issue seems to be specifically about checkRuns (GitHub Actions workflow checks), and create a new PR with the fix for checkRuns only.

I'm not sure our issue is the same as originally mentioned here.

Anyway, here is query I used:

gh api graphql -f query='
query($owner: String!, $name: String!, $number: Int!, $ruleCursor: String, $contextCursor: String) {
  repository(owner: $owner, name: $name) {
    pullRequest(number: $number) {
      reviewDecision
      baseRef {
        branchProtectionRule {
          requiredStatusChecks {
            context
          }
        }
        rules(first: 100, after: $ruleCursor) {
          pageInfo {
            endCursor
            hasNextPage
          }
          nodes {
            type
            repositoryRuleset {
              enforcement
            }
            parameters {
              ... on RequiredStatusChecksParameters {
                requiredStatusChecks {
                  context
                }
              }
              ... on WorkflowsParameters {
                workflows {
                  path
                }
              }
            }
          }
        }
      }
      commits(last: 1) {
        nodes {
          commit {
            statusCheckRollup {
              contexts(first: 100, after: $contextCursor) {
                pageInfo {
                  endCursor
                  hasNextPage
                }
                nodes {
                  __typename
                  ... on CheckRun {
                    name
                    status
                    conclusion
                    detailsUrl
                  }
                  ... on StatusContext {
                    context
                    state
                    targetUrl
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}' -F owner='<repo_owner>' -F name='<repo_name>' -F number='<pr_number>'
henriklundstrom commented 2 weeks ago

@ajax-ryzhyi-r Okay I see, it happens not on rerun, but on re-trigger on the same commit. I was indeed able to reproduce it by closing and reopening a pull request without pushing any new commit.