PR number missing from E2E testing log when a failure is reported

komish commented 5 months ago

I noticed in a recent smoke test run that the new metadata we're adding to log lines seems to indicate we had an issue finding the expected pull request number for a given workflow run. Log is here.

The full snippet:

Feature: Chart tarball submission with report # tests/functional/behave_features/HC-08_report_and_chart_tar.feature:1
  Partners, redhat and community users can publish their chart by submitting
  error-free chart in tarball format with a report.
  @partners @smoke @full
  Scenario Outline: [HC-08-001] A partner or redhat associate submits an error-free chart tarball with report -- @1.1           # tests/functional/behave_features/HC-08_report_and_chart_tar.feature:16
    Given the vendor "hashicorp" has a valid identity as "partners"                                                             # tests/functional/behave_features/steps/implementation.py:6
    And an error-free chart tarball used in "tests/data/vault-0.17.0.tgz" and report in "tests/data/common/partner/report.yaml" # tests/functional/behave_features/steps/implementation.py:67
    When the user sends a pull request with the chart and report                                                                # tests/functional/behave_features/steps/implementation.py:268
    Then the user sees the pull request is merged                                                                               # tests/functional/behave_features/steps/implementation.py:281
      Assertion Failed: Workflow for the submitted PR (#None) did not run.

    And the index.yaml file is updated with an entry for the submitted chart                                                    # None
    And a release is published with corresponding report and chart tarball                                                      # None

I didn't dig into this, but it would seem we have some kind of race condition here, or maybe we're trying to find the pull request for the workflow and not finding it, but also not doing proper error checking.

Creating this issue to make sure it doesn't get lost.

mgoerens commented 3 months ago

Note that we already have the retry decorator around the get_run_id function.

Indeed there's a 30s delay before the test fails, see for example:

2024-05-07T04:08:46.5422908Z     Then the user sees the pull request is merged                                                                            # tests/functional/behave_features/steps/implementation.py:281
2024-05-07T04:09:17.3555539Z       Assertion Failed: Workflow for the submitted PR (#None) did not run.

mgoerens commented 3 months ago

Actually the pr_number is provided to this function, not populated within this function. So retrying won't solve the issue. The race condition occurs somewhere else.

mgoerens commented 3 months ago

Edit: Following debugging while true, is irrelevant

The issue is that `pr_number` can be `None`, this is valid. But in that case we should get `secrects.pr_number` instead, as it's done at many other places in the codebase

Debugging:

Starting at tests/functional/behave_features/steps/implementation.py:281:

@then("the user sees the pull request is merged")
def pull_request_is_merged(context):
context.workflow_test.check_workflow_conclusion(expect_result="success") # This calls chart_certification.py:903
context.workflow_test.check_pull_request_result(expect_merged=True)
context.workflow_test.check_pull_request_labels()

chart_certification.py:903:

def check_workflow_conclusion(self, expect_result: str):
    # Check workflow conclusion
    super().check_workflow_conclusion(None, expect_result) # Calls chart_certification.py:333

chart_certification.py:333

def check_workflow_conclusion(
    self, pr_number, expect_result: str, failure_type="error"
):
    # None, "success", "error"
    try:
        # Check workflow conclusion
        run_id = get_run_id(self.secrets, WORKFLOW_CERTIFICATION_CI, pr_number)
        [...]

So get_run_id is called with pr_number = None.

get_run_id then calls get_pr, which falls back to using secrets.pr_number if pr_number is None (see https://github.com/openshift-helm-charts/development/blob/main/tests/functional/behave_features/common/utils/github.py#L79):

    pr_number = secrets.pr_number if pr_number is None else pr_number

As pr_number is still None after that, that means that the fallback secrets.pr_number is None.

openshift-helm-charts / development