"Can't add a job to a build that is already closed."

lemurheavy / coveralls-public

The public issue tracker for coveralls.io

http://coveralls.io

124 stars 7 forks source link

"Can't add a job to a build that is already closed." #1716

Open thibaultcha opened 1 year ago

thibaultcha commented 1 year ago

Run coverallsapp/github-action@master
Using lcov file: ./lcov.info
Error: Bad response: 422 {"message":"Can't add a job to a build that is already closed. Build 5217978935 is closed. See docs.coveralls.io/parallel-builds","error":true}

What does this mean? All I did was restart a GitHub Action run because some other jobs accidentally failed. Now all jobs pass but for some obscure reason, coveralls refuses the coverage data. Why? It's never, ever been a problem before to restart jobs, coveralls just typically absorb the new data into a new job. So what triggers this?

afinetooth commented 1 year ago

Hi @thibaultcha, we made a change on Apr 27 that prevents adding new coverage reports to closed builds. Practically speaking, this means that, if you have a parallel build with several jobs, let's say 3, you must send all 3 coverage reports (jobs) before sending the request to the (close) parallel build webhook, telling Coveralls to close the build.

If you happen to make that request to the parallel build webhook to close the build before another job is sent, or, say, between jobs 2 and 3.

This has cropped up for some users who had their CI's configured in a way that they were (inadvertently) closing the build in each of their parallel jobs.

Can you help me understand your use case here? Any insight into why your build might have become closed before you send an (additional) job with another coverage report?

One quick fix here might be upgrading to the latest version of the Github Action. v2+ is a major update, and uses an entirely new integration under the hood, our Universal Coverage Reporter, but it should work fine with your setup as far as I can tell.

afinetooth commented 1 year ago

P.S. @thibaultcha, it looks like that build above was deleted, so I can't find your project from that. If you'd like to share the Coveralls URL for your project, I'll be happy to look into it more closely. If your project is private, or sensitive, feel free to email us at support@coveralls.io and just mention this issue.

thibaultcha commented 1 year ago

Can you help me understand your use case here? Any insight into why your build might have become closed before you send an (additional) job with another coverage report?

As far as I can remember I think I just restarted a GHA build because of some other failures (not in unit tests). Our unit tests then re-computed coverage data and sent it back to coveralls.

The project is private indeed, we might have to get in touch with support. I just experienced another issue by having:

6/7 unit test jobs succeed
Restarted the failing unit job (which re-computed coverage), and which failed at the "finished" step:

Sending to coveralls {
  repo_token: '***',
  carryforward: 'unit-ngx_1.25.0-wasmer-ssl-debug-no_hup,unit-ngx_1.25.0-wasmer-ssl-no_debug-no_hup,unit-ngx_1.25.0-wasmtime-ssl-debug-hup,unit-ngx_1.25.0-v8-ssl-debug-no_hup,unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup,unit-ngx_1.21.6-wasmer-ssl-debug-no_hup,unit-ngx_1.25.0-wasmer-no_ssl-no_debug-no_hup',
  repo_name: 'Kong/ngx_wasm_module',
  payload: { build_num: '5250257722', status: 'done' }
}
/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:40831
                        throw new Error('Parallel webhook error:' + err + ', ' + JSON.stringify(data));
                        ^

Error: Parallel webhook error:Error: {"status":500,"error":"Internal Server Error"}, {"status":500,"error":"Internal Server Error"}
    at Request._callback (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:40831:31)
    at Request.self.callback (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:24734:22)
    at Request.emit (node:events:527:28)
    at Request.<anonymous> (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:25703:10)
    at Request.emit (node:events:527:28)
    at IncomingMessage.<anonymous> (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:25625:12)
    at Object.onceWrapper (node:events:641:28)
    at IncomingMessage.emit (node:events:539:35)
    at endReadableNT (node:internal/streams/readable:1345:12)
    at processTicksAndRejections (node:internal/process/task_queues:83:21)

Our finished step looks like so:

  finish:
    needs: tests
    runs-on: ubuntu-latest
    steps:
      - name: Coveralls Finished
        uses: coverallsapp/github-action@master
        with:
          github-token: ${{ secrets.github_token }}
          parallel-finished: true
          carryforward: "unit-ngx_1.25.0-wasmer-ssl-debug-no_hup,unit-ngx_1.25.0-wasmer-ssl-no_debug-no_hup,unit-ngx_1.25.0-wasmtime-ssl-debug-hup,unit-ngx_1.25.0-v8-ssl-debug-no_hup,unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup,unit-ngx_1.21.6-wasmer-ssl-debug-no_hup,unit-ngx_1.25.0-wasmer-no_ssl-no_debug-no_hup"

afinetooth commented 1 year ago

Ok, thanks. Looking into it here.

afinetooth commented 1 year ago

Hi @thibaultcha. We think the more recent issue was infrastructure-related and has been fixed as described here.

Please let me know if that's not the case for you. I checked the private accounts associated with your Coveralls user account and see builds from today for each, so I'm hopeful we've addressed the issue.

Update

Still checking into the original issue for you. Trying to determine whether this is the correct behavior for a re-run from CI. Does not seem like it, but will feedback with further insights / updates.

afinetooth commented 1 year ago

Hi @thibaultcha.

tl;dr: An update and a workaround.

Details: An update on the original issue:

Your issue has prompted an internal discussion about whether, and how, we could walk back this "closed" state of a parallel build. To clarify, the expected behavior is, indeed, currently, what you experienced. If the build was closed, you will not be able to submit additional jobs to it, or re-run it in CI and have that generate a new build.

We think the latter case (re-running the build in CI) should probably be allowed, but we are trying to determine what information would allow us to safely know we can replace the existing build (with identical sha) and whether we need to hold it until the new build completes, and roll back if it does not, etc. Concerns like that.

So, I'm afraid there will be some more time required for the internal discussion and any changes that come out of it. But I will keep you updated here on that.

In the meantime, there is a workaround here, which is simply to delete the old build, at which point you should be able to re-run the CI build without issue.

Here's where to delete a build in the UI. Example: https://coveralls.io/builds/33054356

Let me know if you have any issues with that.

thibaultcha commented 1 year ago

Thanks for the update! I guess I don't really understand why this changed? In the past, we've been able to restart as many builds as we want and just get a new coverage stats from our coverage tools (whether it be coveralls or codecov). Can this not be done anymore? The reason why builds in the same PR are restarted is because once we get coverage data from our first CI run, typically our workflow involves adding new tests to increase coverage and force-push the branch to the PR to get a new coverage status, and iterate until all new lines in the PR are covered as desired. Is this workflow incompatible with Coveralls? Or do we now need to manually delete the CI build to run the PR's CI again?

thibaultcha commented 1 year ago

Well, I just tried it on a branch I have been working on and got the HTTP 422 again, so it seems to be the case that this changed recently because until now, we never had an issue restarting builds :/

afinetooth commented 1 year ago

Hi @thibaultcha. Without getting into the minutiae, this change relates to addressing a different set of issues, and is also part of a recent refactor, which now just means that we can't quickly re-wind the functionality to support CI re-runs at this moment.

That said, we are actively discussing this and I am trying to determine an ETA on that that I can share with you ASAP.

In the meantime, I have a doubt that I'd like to share and get some more clarification on your use case, because, aside from the workaround I offered, which, I admit, is inefficient, I'm thinking about a more standard use case we see where users iterate on PRs:

So the "doubt": I don't understand why iterating on your PRs would require you to re-run old CI builds. I say this because the use case I'm familiar with, for doing that, is to configure CI to build pull_request commits on your PR branches (branches that target your default branch). What happens then, during iteration, in steps, is as follows:

User commits some (first) changes to a (new) PR branch - If CI is configured to build push builds everywhere, then CI builds that commit, and send Coveralls a coverage report for the first commit on "New PR Branch". If CI is only configured to build push commits on the default branch, then no build and no coverage report for this commit.
User opens a PR on the new PR branch - If CI is configured to build pull_request commits on PR branches (recommended), then CI generates a new pull_request build, which sends a coverage report to Coveralls, and we generate one of our pull_request builds (new build).
User makes changes to PR - Based on feedback, including coverage reports, user makes changes to the PR in the form of one or more push commits to "New PR Branch" and, again, depending on CI config, CI will, or won't, build those commits and send coverage reports to Coveralls. (Generally speaking, for cleanliness and simplicity, we recommend not building push commits on your PR branches, just pull_request commits.)
New pull_request builds are generated by CI - So, if CI is configured to build pull_request commits on your PR branches, then a new pull_request build will be generated after each change (push commit) to "New PR Branch." These (new) pull_request builds send their coverage reports to Coverall, which generates new Coveralls builds for each CI build.
And so on...

So, following the model above, it's my feeling that you should be able to iterate on PRs by generating new pull_request builds along the way, without needing to rebuild any old ones.

Thanks in advance for helping me understand your use case.

nebolsin commented 1 year ago

@thibaultcha here's a workaround which should help with the issue you're experiencing. Add the following block to your Github Actions workflow (I recommend placing it at top level right above the jobs: section, so it applies to both normal coverage upload and build finalisation job):

env:
  COVERALLS_SERVICE_NUMBER: ${{ github.run_id }}-${{ github.run_attempt }}

I would also suggest upgrading Coveralls Action to v2. You're currently using master, which is v1 (stable).

This should unblock your workflow while we're analyzing the use-case and working on long-term solution.

afinetooth commented 1 year ago

Thanks @nebolsin. @thibaultcha, pls let us know how it goes.

thibaultcha commented 1 year ago

Thank you both for the help and suggestions! My whole team is out until Tuesday but I will try it out, maybe sooner if I get a chance to.

thibaultcha commented 1 year ago

Trying the above suggestion in the repo currently.

@afinetooth Perhaps the only difference between the workflow you elaborated and ours, is that we force-push PRs quite often. Clean git history is paramount to us and when we increase coverage, we do so in the same commit (hence, force-pushed) rather than pushing new commits. This is paramount to us because it keeps reviews simple, git history simple, workflow simple, etc... Another possibility is we can restart a PR run without any git changes: no new commit, no force-push, just restarting either because one of the jobs unexpectedly failed or simply because we feel like it, we want to test CI is super robust or whatever else reason. In the past, restarting PR jobs with or without git changes has never been an issue with either codecov or coveralls (we recently switched from codecov to coveralls and didn't experience any issues so far - except perhaps no bot comment on missed diff lines).

thibaultcha commented 1 year ago

New error from the build on the PR where I added the above env variable:

Sending to coveralls {
  repo_token: '***',
  carryforward: 'unit-ngx_1.25.0-wasmer-ssl-debug-no_hup,unit-ngx_1.25.0-wasmer-ssl-no_debug-no_hup,unit-ngx_1.25.0-wasmtime-ssl-debug-hup,unit-ngx_1.25.0-v8-ssl-debug-no_hup,unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup,unit-ngx_1.21.6-wasmer-ssl-debug-no_hup,unit-ngx_1.25.0-wasmer-no_ssl-no_debug-no_hup',
  repo_name: 'Kong/ngx_wasm_module',
  payload: { build_num: '5335826735', status: 'done' }
}
/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:40831
                        throw new Error('Parallel webhook error:' + err + ', ' + JSON.stringify(data));
                        ^

Error: Parallel webhook error:Error: {"error":"No build matching CI build number 5335826735 found"}, {"error":"No build matching CI build number 5335826735 found"}
    at Request._callback (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:40831:31)
    at Request.self.callback (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:24734:22)
    at Request.emit (node:events:527:28)
    at Request.<anonymous> (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:25703:10)
    at Request.emit (node:events:527:28)
    at IncomingMessage.<anonymous> (/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:25625:12)
    at Object.onceWrapper (node:events:641:28)
    at IncomingMessage.emit (node:events:539:35)
    at endReadableNT (node:internal/streams/readable:1345:12)

Most unit tests pass but 1 job which failed (we have some tests relying on external I/O that can sometimes fail). When I restarted the failing job, it passed but the finish step failed with the above. Maybe merging the suggested fix would have avoided this?

afinetooth commented 1 year ago

Hi @thibaultcha. Hmm. That error means one of two things:

The wrong build ID was passed in the (close) parallel build webhook; or, more likely,
That build was deleted by the time the webhook was sent.

I've asked an engineer to have a look.

afinetooth commented 1 year ago

@thibaultcha one quick question from this side:

Can you please switch to @v2 of the Coveralls Github Action? We believe that, if you're still on @master, switching should fix that issue.

thibaultcha commented 1 year ago

Yesterday I applied the env fix without bumping to @v2 and got Error: Parallel webhook error:Error: {"error":"No build matching CI build number 5341291030 found"}, {"error":"No build matching CI build number 5341291030 found"} on a push to main...

This afternoon I was working on a branch to fix this:

At first I had the env fix and bumped to @v2. CI passed and coveralls produced a new build. I restarted a unit test job, to see what would happen, and got the same error:

---
Error: HTTP status code 422: Unprocessable Entity
Response: {"message":"Can't add a job to a build that is already closed. Build 5349608376 is closed. See docs.coveralls.io/parallel-builds","error":true}
---

I removed the env fix, kept @v2 and force-pushed the PR branch, and got the other error:


/home/runner/work/_actions/coverallsapp/github-action/master/dist/index.js:40831
                    throw new Error('Parallel webhook error:' + err + ', ' + JSON.stringify(data));
                    ^

Error: Parallel webhook error:Error: {"error":"No build matching CI build number 5350504813 found"}, {"error":"No build matching CI build number 5350504813 found"}



Now, I will try and manually delete all the past builds of this PR branch and run our CI workflow again, at least to see if non-forced-pushed branches works again.

Could something else be missing to make the `env` fix work maybe? I did add it as a top-level `env` specified in our `ci.yml` workflow.

thibaultcha commented 1 year ago

Now, I will try and manually delete all the past builds of this PR branch and run our CI workflow again, at least to see if non-forced-pushed branches works again.

After doing so and restarting the build (@v2, no env fix), the finish job again failed with:

Error: Parallel webhook error:Error: {"error":"No build matching CI build number 5350504813 found"}, {"error":"No build matching CI build number 5350504813 found"}

thibaultcha commented 1 year ago

Currently this has stalled our workflow and ability to make progress in our work, we have half a dozen PRs that are stalled because they can't produce a coverage report and fail with either error, both on @master, @v2, with or without the env fix, and even after manually deleting existing coverage builds. What can we do please?

afinetooth commented 1 year ago

@thibaultcha Yes, those results above for your different attempts match our expectations (with one exception; see my next comment below, pls.)

We have a PR in review right now that should support your PR workflow as we understand it.

I don't have an ETA right this moment, but it's prioritized so I'm hopeful it will be in place and clear your log jam by Mon. I'll keep you updated here. But in the meantime, of course, feel free to reach out to us at support@coveralls.io.

afinetooth commented 1 year ago

@thibaultcha one update though. We were confused about why this attempt did not work for you.

We suspect that you may have only updated the action version (@v2) on your coverage report step, and did not update it on your (close) parallel build / webhook step.

We say that because, from a build we found that we think matches your attempt, that your (close) build step uses an unmodified value for service_number (aka. COVERALLS_SERVICE_NUMBER env var), which is to say, without -<attempt> at the end.

This would cause the build to fail in the way it did above, because the build now has a service number like 12345-1, but the (close) build webhook reports completion for 12345.

The point is that, if you can verify that's the case and make sure to update the action version to @v2 in both/all places, we think that will give you a workaround that will work until we can deliver this change to support your PR workflow.

thibaultcha commented 1 year ago

We suspect that you may have only updated the action version (@v2) on your coverage report step, and did not update it on your (close) parallel build / webhook step.

Yes you were right! That was my bad. One of my coworkers noticed it, updated it and we since had a rebased PR that successfully built its coverage job! That is with @v2 in both places and with the env fix indeed. Great, thank you!

Is the PR a way that will allow for this without setting the env value? Not in the least of a rush, only curious.

Feel free to close this issue, the fix is reasonable and we can resume merging, all good on our end. Thank you both for the help once again!

afinetooth commented 1 year ago

@thibaultcha Excellent. Glad to hear that.

Regarding the PR, which is in review and I hope will be deployed this week or next: As I understand it, it will automatically track github.run_attempt so that you don't need to specify the env. That said, I'm not always right, so I would appreciate you testing it and letting us know if that doesn't work without the env, because I do believe that is the intent.

I will close this for now, but I'll make a note to update this issue when the PR is deployed, and please just reopen if you want to check on status.

thibaultcha commented 1 year ago

After a few days of use in our main branch, I'm reporting back with some updates on how I've seen this workflow behave, and wondering if there is still a missing piece to this puzzle or if the upcoming patch will address this; hoping to get your eyes on this again @afinetooth 🙏🏼

Earlier today we had a CI build in main that experienced a failed job: 61046409. It remains pending as 7 out of 8 of the parallel jobs for this build successfully passed and uploaded coverage, but the last one failed an I/O test and thus didn't report anything to Coveralls. We restarted it, hoping it'd succeed and send the 8th and last coverage report.

It then failed once again but this time during the Coveralls upload step (this was around the GitHub CI downtime earlier today):

Run coverallsapp/github-action@v2
Run mkdir -p ~/bin/
Error: Process completed with exit code 28.

Some time later once the downtime situation resolved itself, I once again restarted the failed job which succeeded this time, except finished with:

Using coverage file: ./lcov.info
⭐️ Running in parallel mode. You must call the webhook after all jobs finish: `coveralls done --build-number 5414007445-3`
  ·job_flag: unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup-dynamic
🚀 Posting coverage data to https://coveralls.io/api/v1/jobs
---
✅ API Response: {"message":"Coverage for parallel build uploaded","url":"https://coveralls.io/builds/61050538"}

In the above output of the failed job (attempt 3), the --build-number 5414007445-3 uploads the job to 61050538 instead of the existing 61046409 build. All other 7 jobs that succeeded on their first attempt had --build-number 5414007445-1 and uploaded the result to 61046409, the original build for this commit.

I double-checked that we have:

env:
  COVERALLS_SERVICE_NUMBER: ${{ github.run_id }}-${{ github.run_attempt }}

at the root context of our workflow, and only use coverallsapp/github-action@v2.

Could something be missing to merge parallel jobs that are individually restarted from within the same build? I am under the impression that "individually restarted jobs" might be the key difference here between why this worked in force-pushed PRs but isn't working in this case, but I could be missing something more obvious.

thibaultcha commented 1 year ago

It has been happening again this afternoon. As we have been moving GitHub Actions workflows to "reusable jobs", I had been force-pushing refinements to a branch attached to a PR but eventually as the code was ready for merging, Coveralls started failing with the same error:

📄 Using coverage file: ./lcov.info
⭐️ Running in parallel mode. You must call the webhook after all jobs finish: `coveralls done --build-number 5470721728-1`
  ·job_flag: unit-ngx_1.21.6-wasmer-ssl-debug-no_hup-static
🚀 Posting coverage data to https://coveralls.io/api/v1/jobs
---
Error: HTTP status code 422: Unprocessable Entity
Response: {"message":"Can't add a job to a build that is already closed. Build 5470721728-1 is closed. See docs.coveralls.io/parallel-builds","error":true}
---
🚨 Oops! It looks like your request was not processible by Coveralls.

@afinetooth @nebolsin I cannot reopen this issue myself but would love some eyes on this and on the above problem as well, cheers

afinetooth commented 1 year ago

Hi @thibaultcha . Sorry for the delay. Catching up and will update this message as I construct my reply....

OK, regarding this issue from last week:

I double-checked that we have:
env:
  COVERALLS_SERVICE_NUMBER: ${{ github.run_id }}-${{ github.run_attempt }}
at the root context of our workflow, and only use coverallsapp/github-action@v2.

Could something be missing to merge parallel jobs that are individually restarted from within the same build? I am under the impression that "individually restarted jobs" might be the key difference here between why this worked in force-pushed PRs but isn't working in this case, but I could be missing something more obvious.

I'm thinking that, perhaps, if you restarted the failed job after we pushed our PR and that, beyond not requiring the -${{ github.run_attempt }} suffix on build number, it actually became important to remove it. At least, by your description of the behavior (the state of the original build has changed a bit since last week), it sounds like Coveralls interpreted the re-run job as a new build.

Take my interpretation with a grain of salt right now, please. I have submitted this to engineering and will get a more definitive answer for you on this one.

Regarding the issue from yesterday afternoon: I'm not sure, but I think that, at this point, Coveralls is expecting a ${{ github.run_attempt }} value greater than 1, to understand it should re-open the build.

I have included this in my message to engineering, but, in the meantime, can you help me understand if / verify that you were re-running a job here, in the context same build / run-attempt? I'm just not super familiar with the use case of "reusable jobs" as you referred to it here. Can you point me to a doc or something?

Thanks. Will get back to you asap.

thibaultcha commented 1 year ago

I'm just not super familiar with the use case of "reusable jobs" as you referred to it here. Can you point me to a doc or something?

Yes, reusable workflows is how GitHub calls it. So now we have ci.yml with:

    uses: ./.github/workflows/job-unit-tests.yml
    with:
      os: ${{ matrix.os }}
      cc: ${{ matrix.cc }}
      ngx: ${{ matrix.ngx }}
      openresty: ${{ matrix.openresty }}
      runtime: ${{ matrix.runtime }}
      wasmtime: ${{ matrix.wasmtime }}
      wasmer: ${{ matrix.wasmer }}
      v8: ${{ matrix.v8 }}
      debug: ${{ matrix.debug }}
      ssl: ${{ matrix.ssl }}
      hup: ${{ matrix.hup }}
      module_type: ${{ matrix.module_type }}
      coverage: true
      carryforward: 'unit-ngx_1.25.1-wasmer-ssl-debug-no_hup-static,unit-ngx_1.25.1-wasmer-ssl-no_debug-no_hup-static,unit-ngx_1.25.1-wasmtime-ssl-debug-hup-static,unit-ngx_1.25.1-v8-ssl-debug-no_hup-static,unit-ngx_1.21.6-wasmer-ssl-debug-no_hup-static,unit-ngx_1.25.1-wasmer-no_ssl-no_debug-no_hup-static,unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup-static,unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup-dynamic'
    secrets: inherit

And job-unit-tests.yml with:

name: Unit tests

on:
      # ...
      carryforward:
        required: false
        type: string

defaults:
  run:
    shell: bash

env:
  # ...
  COVERALLS_SERVICE_NUMBER: ${{ github.run_id }}-${{ github.run_attempt }}

jobs:
  tests:
    name: 'Unit tests'
    runs-on: ${{ inputs.os }}
    timeout-minutes: 40
    outputs:
      coveralls_name: ${{ steps.lcov.outputs.name }}
    steps:
      # ...
      - run: make setup
      - run: make
      - run: make test
      - name: Run lcov
        id: lcov
        if: ${{ !env.ACT && inputs.cc == 'gcc-9' && inputs.coverage }}
        run: |
          lcov --capture --directory work/buildroot --output-file lcov.info
          lcov --extract lcov.info "*/ngx_wasm_module/src/*" --output-file lcov.info

          name="unit"
          if [ -n "${{ inputs.openresty }}" ]; then
            name="$name-openresty_${{ inputs.openresty }}"
          else
            name="$name-ngx_${{ inputs.ngx }}"
          fi
          name="$name-${{ inputs.runtime }}"
          name="$name-${{ inputs.ssl }}"
          name="$name-${{ inputs.debug }}"
          name="$name-${{ inputs.hup }}"
          name="$name-${{ inputs.module_type != '' && inputs.module_type || 'static' }}"
          echo "name=$name" >> $GITHUB_OUTPUT
      - name: Coveralls Upload
        if: ${{ !env.ACT && inputs.cc == 'gcc-9' && inputs.coverage }}
        uses: coverallsapp/github-action@v2
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          flag-name: ${{ steps.lcov.outputs.name }}
          path-to-lcov: './lcov.info'
          parallel: true

  coverage:
    if: ${{ inputs.coverage }}
    needs: tests
    runs-on: ubuntu-latest
    steps:
      - name: Coveralls Finished
        if: ${{ !env.ACT && inputs.cc == 'gcc-9' }}
        uses: coverallsapp/github-action@v2
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          parallel-finished: true
          carryforward: ${{ inputs.carryforward }}

However it seems that with this new GHA setup, even the 1st run/attempt of a PR in which all tests pass, it fails during coverage as Coveralls responds with the HTTP 422 error after the first job upload.

First job that finished its unit tests and uploads the report:

"Coveralls upload" step:

 Running in parallel mode. You must call the webhook after all jobs finish: `coveralls done --build-number 5470811121-1`
  ·job_flag: unit-ngx_1.25.1-wasmer-no_ssl-no_debug-no_hup-static
🚀 Posting coverage data to https://coveralls.io/api/v1/jobs
---
✅ API Response: {"message":"Coverage for parallel build uploaded","url":"https://coveralls.io/builds/61163701"}

And "coverage" step (aka "finish"):

⭐️ Calling parallel done webhook: https://coveralls.io/webhook
---
✅ API Response: {"done":true,"url":"https://coveralls.io/builds/61163701","jobs":4}

All other unit tests jobs after that one error-out in "Coveralls upload" with:

Running in parallel mode. You must call the webhook after all jobs finish: `coveralls done --build-number 5470811121-1`
  ·job_flag: unit-ngx_1.25.1-wasmtime-ssl-debug-hup-static
🚀 Posting coverage data to https://coveralls.io/api/v1/jobs
---
Error: HTTP status code 422: Unprocessable Entity
Response: {"message":"Can't add a job to a build that is already closed. Build 5470811121-1 is closed. See docs.coveralls.io/parallel-builds","error":true}
---

thibaultcha commented 1 year ago

Looking at this again, I'm starting to suspect "coverage" (or the "Coveralls Finish" step) might not be "reusable" and should be extracted out of the reusable job file and be put in ci.yml instead. Does that sound better? I'm trying it now.

afinetooth commented 1 year ago

I'm just not super familiar with the use case of "reusable jobs" as you referred to it here. Can you point me to a doc or something?

Yes, reusable workflows is how GitHub calls it. So now we have ci.yml with [...]

Aha. OK, thank you. I am familiar with reusable workflows. We have an example across these two workflows: here (the reusable workflow) and here (an example of its reuse).

I'm starting to suspect "coverage" (or the "Coveralls Finish" step) might not be "reusable" and should be extracted out of the reusable job file and be put in ci.yml instead. Does that sound better? I'm trying it now.

That makes sense!

thibaultcha commented 1 year ago

That worked! But only after adding:

env:
  COVERALLS_SERVICE_NUMBER: ${{ github.run_id }}-${{ github.run_attempt }}

to both of ci.yml and job-unit-test.yml, top-level each, as it env variables are not inherited (or propagated as per the docs) to called workflows.

Now, this build succeeded in https://coveralls.io/builds/61190187 which corresponds to build 2135 on our end (GHA build number). I am going to restart a unit test job in 2135 as soon as I can, and see what happens with this configuration.

thibaultcha commented 1 year ago

Now, restarting a no_ssl test job (the fastest), which succeeded again caused Coveralls Upload to succeed with a different build number:

 Using coverage file: ./lcov.info
⭐️ Running in parallel mode. You must call the webhook after all jobs finish: `coveralls done --build-number 5481106357-2`
  ·job_flag: unit-ngx_1.25.1-wasmer-no_ssl-no_debug-no_hup-static
🚀 Posting coverage data to https://coveralls.io/api/v1/jobs
---
✅ API Response: {"message":"Coverage for parallel build uploaded","url":"https://coveralls.io/builds/61191165"}
- 💛, Coveralls

The coverage step seems to have ran again and only once as expected. However, it finished with:

⭐️ Calling parallel done webhook: https://coveralls.io/webhook
---
✅ API Response: {"done":true,"url":"https://coveralls.io/builds/61191165","jobs":4}
- 💛, Coveralls

In which the number of jobs added to this build is 4. That part is surprising considering I only restarted 1 job out of 8. Upon looking closer at this new 61191165 build, it seems to be correctly labelled 5481106357-2 (2nd run of 5481106357 indeed), but I am unsure as to where these 3 other jobs are taken from. They are labelled 5480494521-1.*, but I would have expected 5481106357-1.* to be used instead (so, jobs from the initial build 61190187). I would expect only 5481106357-2.1 to supersede 5481106357-1.1, then that the build be completed with all other existing 5481106357-1.{2,8} jobs. Does that make sense and are we somehow misconfigured to handle this scenario?

thibaultcha commented 1 year ago

Same thing happened again in the same PR, this time after more updates (unrelated to coveralls, just CI changes on our end), one test was failing, so this build: https://coveralls.io/builds/61193647 never completed (7 out of 8 jobs passed, and since the last one is pending I think the whole build is pending).

I restarted the failed job which had the following Coveralls upload:

📄 Using coverage file: ./lcov.info
⭐️ Running in parallel mode. You must call the webhook after all jobs finish: `coveralls done --build-number 5483005233-2`
  ·job_flag: unit-ngx_1.25.1-wasmer-ssl-no_debug-no_hup-static
🚀 Posting coverage data to https://coveralls.io/api/v1/jobs
---
✅ API Response: {"message":"Coverage for parallel build uploaded","url":"https://coveralls.io/builds/61206122"}
- 💛, Coveralls

And the build at https://coveralls.io/builds/61206122 contains 3 jobs (again, that number seems to always been aleatory):

unit-ngx_1.25.1-wasmer-ssl-no_debug-no_hup-static - 5483005233-2.1 (this is the one I restarted and is expected)
unit-ngx_1.25.1-wasmer-ssl-debug-no_hup-static - 5481822898-1.4 (this one seems unrelated)
unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup-dynamic - 5481822898-1.7 (also seems unrelated and randomly chosen)

I'd expect the same behavior as yesterday evening restarting the no_ssl job (my previous comment): that all 5483005233-1.* job be included along with 5483005233-2.1 being the only one superseding 5483005233-1.1 (so the 2nd run of job 1 supersedes the 1st run of job 1).

This is the 2nd issue which I wanted to reopen this issue for: how do we support restarting individual jobs within the same PR? In other words, the same build on the GHA side, but a fresh Coveralls build (which reuses results of other, non-restarted jobs).

afinetooth commented 1 year ago

@thibaultcha Thanks for the updates.

(BTW, I have re-opened the issue. For some reason I wasn't able to yesterday: I was not seeing the usual "Reopen issue" button alongside the "Submit/Update comment" button.)

Re:

[...] restarting a no_ssl test job (the fastest), which succeeded again caused Coveralls Upload to succeed with a different build number...

Per:

Does that make sense and are we somehow misconfigured to handle this scenario?

Yes, and I would have the same exact expectations. So, to be clear, I would expect:

The new job to be added this this build as 5481106357-2.1;
Would not expect a new build; and
Would not expect any carryforward jobs in the new build, but as a matter of source, since the config has carryforward jobs, it would make sense to see them in a new build, but all seven (7) and not just four (4).

I have added this case to the open ticket with engineering...

Re:

[...] one test was failing, so this build: https://coveralls.io/builds/61193647 never completed (7 out of 8 jobs passed, and since the last one is pending I think the whole build is pending)...

Yes. Currently (after recent changes to optimize the way we process builds), we don't process the jobs for a parallel build until we receive the done (aka. parallel-finish) POST. In the meantime, we store the original uploads, which we are considering adding as a section to the PENDING COMPLETION view.

Per:

I'd expect the same behavior as yesterday evening restarting the no_ssl job (my previous comment): that all 5483005233-1.* job be included along with 5483005233-2.1 being the only one superseding 5483005233-1.1 (so the 2nd run of job 1 supersedes the 1st run of job 1).

Yes, I have the same expectation.

So, we're on the same page. I think engineering will be too.

I don't know what the exact answer will be, but I suspect this should work correctly without the ${{ github.run-attempt }} portion. That said, I don't think that's what's causing these issues.

Questions:

What is the behavior / error message you get when you exclude -${{ github.run-attempt }} from all instances where it appears in the CI config?
Re: your restatement of the supportable use case below, I'm confused by this part: "In other words, the same build on the GHA side, but a fresh Coveralls build (which reuses results of other, non-restarted jobs)." because I would expect that you'd want to see an updated Coveralls build, rather than new/fresh one. Can you help me by explaining a bit more why you think there should be a second build for this use case? (Or am I reading you wrong and you just mean a second version of the original Coveralls build?) Related to this question, would it be correct to restate the supportable use case, or story, as just, "How do we support restarting individual jobs within the same PR build?" and have a few options for how we would support that, or do you feel there's a good reason for something prescribed, like "two (successive, related) Coveralls builds." (I ask this because I think it's one of our goals to maintain only one build per commit sha.)

Use case:

[H]ow do we support restarting individual jobs within the same PR [build]? In other words, the same build on the GHA side, but a fresh Coveralls build (which reuses results of other, non-restarted jobs).

thibaultcha commented 1 year ago

Yes, and I would have the same exact expectations. So, to be clear, I would expect:

Yes, I think these are my expectations as well, assuming I understood correctly I believe we are on the same page 👍🏼

What is the behavior / error message you get when you exclude -${{ github.run-attempt }} from all instances in the CI config?

About to do this in the same PR, meaning:

Make a change removing the env/run-attempt fix from both ci.yml and job-unit-tests.yml.
Force-push the PR, which will trigger a new GHA build + new Coveralls build. The unit jobs may or may not all succeed on the first attempt (again, this is out of our immediate control as we rely on existing tests from other projects which can fail due to external I/O). a. If one or more jobs failed, I will restart them and see what happens on the Coveralls side. b. If no jobs failed, I will also restart one of the jobs and see.

afinetooth commented 1 year ago

@thibaultcha FYI, I just finished my comment above. I added some more to it after your post above.

Considering something you said in your post though, about your next attempt:

Force-push the PR, which will trigger a new GHA build + new Coveralls build.

Per my comment:

I think it's one of our goals to maintain only one build per commit sha.

This scenario does seem to be a case where you could end up with more than one CI build (and build number) for a given commit sha, so it may be more correct to say that our goal is to maintain one Coveralls build for each CI build.

thibaultcha commented 1 year ago

2. I'm confused by this part: "In other words, the same build on the GHA side, but a fresh Coveralls build (which reuses results of other, non-restarted jobs)." because I would expect that you'd want to see an updated Coveralls build, rather than new/fresh one.

It'd be totally fine if it were the same, updated Coveralls build instead of a new one for sure. I only mention "the new build" because that is the behavior I was seeing (i.e. the new builds with 3/4 jobs, all of them but one being unrelated).

2. would it be correct to restate the supportable use case, or story, as just, "How do we support restarting individual jobs within the same PR build?"

Yes, that's is how I would phrase it and if there is only one build which gets constantly refreshed with newer jobs, that'd be totally fine by me.

thibaultcha commented 1 year ago

The job without the env fix has finished and all 8 jobs succeded and were uploaded to https://coveralls.io/builds/61209082. As soon as other CI jobs are finished (unrelated to coverage), I'll restart the no_ssl unit job.

thibaultcha commented 1 year ago

Ok I just got a chance to do it. Interesting, it added a 9th job to the same build: https://coveralls.io/builds/61209082

afinetooth commented 12 months ago

Yes, it is interesting. Unfortunately, wrong as well. :(

I can't tell what happened from the data, myself, so I've passed it on to an engineer.

I'll update you here when I hear back.

afinetooth commented 11 months ago

Hi @thibaultcha. Apologies for the delay. I have been picking up different pieces of our internal conversation about this over the past few days and will do my best to summarize here. Then, I hope we can pick up the relevant threads, which I think will be items (2) "Use case for Part (B)"; and (3) "Temporary support for Part (B)"

This is a high-level summary of the discovery/analysis/conversation on our side:

Only one of two parts of your workflow is currently supported - Only after your most recently reported issue described here, here, and then here, did we realize that there's a second aspect of your use case that you wished to be supported. I'll refer to the two aspects as (A) and (B) as follows, but the use case we aimed to support in our recent PR on Jun 27 was (A) Adding new jobs to a closed build, and it did not include support for (B) Re-running a successful job and having it replace its previous version (in a closed build).
We'd like to fully understand your use case for "Part (B)" - We understand now that both parts (A) and (B) were relevant to your use case and that, prior to Apr 27, you were finding Coveralls functionality useful to this end; and that, with the changes we made on Apr 27 to prevent updates to closed builds after-the-fact (for reasons pertaining to other use cases), we broke that functionality for you. The issue, from our perspective, was that the behavior you were making use of was technically undocumented behavior we didn't think we were supporting as a feature, which is how it got lost in the updates. Regardless, we'd like to understand your use case for Part (B) so we can support it.
"Temporary" support for Part (B) - We'd like to understand your use case for Part (B) so that we can support it, at least temporarily, with either some further code changes or a workaround. We say "temporarily" because we are in progress on a complete refactor / redesign of our pipeline processing flow and, by definition, any code changes we make now will almost certainly be temporary due to that changing context, and, after that's complete in ~1-3 months, it's unclear whether your solution will have to change or can stay the same. We'll attempt to support it in a permanent way now, but it's just unclear whether the same approach will make sense, or be the best one for you, in the new context. So, to restate with a little more clarity: while we intend to make the support permanent, the solution itself might be temporary.

thibaultcha commented 11 months ago

Hi @afinetooth,

Yes, I can go through more background which I think will clarify things, and elaborate on the workflow. I cannot fathom however that it would be unique in any way.

The core of our use-case isn't so much restarting successful jobs; although I do believe supporting this should be a must, if for no other reason only because it is supported by GitHub Actions themselves. If Coveralls does not support restarting successful jobs however foolish that may sound, could it be argued that Coveralls does not support 100% of GitHub Actions' use-cases? No hard feelings, just a thought. That said we still have a proper reason for restarting successful jobs but I will elaborate on it below.

Let me first elaborate on some background of which the context I believe might help:

Originally we were using Codecov which was giving us direct coverage comments not only in the PR thread, but also as highlights in the PR diff, directly in the GitHub UI.
For this reason, it quickly became convenient to build features, open a PR, wait for the Codecov comments, and polish coverage after based on that feedback integrated in the GitHub UI. We have a lot of jobs to run to get a full picture of the coverage, and this is the fastest way. This workflow was great. (I have been meaning to ask you perhaps whether such a "Coveralls bot" also exists? It's possible I didn't search hard enough)
Eventually, we made the switch to Coveralls for cost-saving purposes since our repository was private and the company is already paying for a (larger) Coveralls account (we were the only Codecov users). After the switch I found there are things I like better on Coveralls, other things I liked better on Codecov (e.g. the comment bot), anyway, that's off-topic.
Everything seemed fine after the switch and the transition happened smoothly; until this issue (the HTTP 422 & rest of this thread) started happening following the internal rewrite you are mentioning.

Now on to the workflow specifically:

We often open PRs as a "draft" feat commit and perhaps a few other tangential ones. As elaborated above, we address coverage based on the resulting coveralls.io feedback.
When addressing coverage, new tests are added to the feat commit (they do not belong in another commit, they are part of the feat change, we only want to merge through GitHub's "Rebase and merge" button). The updated feat commit is then force-pushed to the same PR for a fresh CI build and new coverage update. We repeat this process until satisfied with the commit, which means being able to restart all jobs, or one job, and always get a "latest view of this PR's coverage".
Sometimes, some of our tests fail and coverage is incomplete (it is missing the failed job's coverage). This incomplete coverage can mess up the main branch's CI status because Coveralls reports that coverage dropped. True, but it didn't have to if we could have restarted the failed job(s) and absorbed their coverage results. This is a key missing part today both for PRs or for maintaining the main branch in a healthy state (our occasional test failures are external I/O, some our tests, others in vendor-ed test cases). In the case of one or more failed jobs, we have no choice but to restart them until they all pass. Alas, doing so today creates the situation described earlier in the thread, which wasn't the case before the change you mentioned.
Sometimes, all of our tests pass. Yet, we may be improving previously unreliable tests and we may want to restart one/all successful job(s) on a whim, several times, making sure the "flaky test" is more consistent. This is a real use-case we have used countless times during development, I cannot fathom we would be the only victims of "flaky tests" and that no-one else ever needs to restart their failed/successful jobs.

I hope this provides enough information. If I were to sum it all up I think it would be: "the ability to restart all or individual jobs (failed or not) and have the result override the corresponding job in the existing build's state" (or create a new build altogether, with merged jobs).

afinetooth commented 11 months ago

@thibaultcha This is all very helpful. Please let me process it and share it with the rest of our team. Will get back asap.

thibaultcha commented 11 months ago

Hi @afinetooth, are there any updates on this?

afinetooth commented 11 months ago

Hi @thibaultcha . Not at this moment, I'm sorry. We're trying to get a major set of changes out and have been focused on that for the past several weeks. We are hoping to clear it by week's end and turn to other tasks, but either way I'll try to pull someone to help me specify the changes we could make to address the different aspects of your use case. I personally feel I understand it and have made some recommendations, but I need to bounce them off an engineer.

Also, I'll need to follow that up with a final design before we go into development. But by that time we'll know which of our sprints it's scheduled for.

To set expectations, it could take me until EOW to get back with a draft spec.

afinetooth commented 11 months ago

Hi @thibaultcha,

As I work on the spec here I want to request some clarifications. I believe I fully understand your workflow---thanks for explaining it clearly and fully. At the same time, please correct me if I've made any incorrect assumptions.

Draft spec at bottom.

Now on to the workflow specifically:

We often open PRs as a "draft" feat commit and [...] address coverage based on the resulting coveralls.io feedback.

When addressing coverage, new tests are added to the feat commit [...] The updated feat commit is then force-pushed to the same PR for a fresh CI build and new coverage update. We repeat this process until satisfied with the commit, which means being able to restart all jobs, or one job, and always get a "latest view of this PR's coverage".

Just to say: I'm clear on this.

While it's not the typical workflow I'm personally familiar with, it makes a lot of sense and I can see that it leads to a repo that's much easier-to-read than normal, with succinct changes, where each of those changes maintains a high quality standard in terms of test coverage. So, I completely understand and agree that the workflow should be supported.

Sometimes, some of our tests fail and coverage is incomplete [... which] can mess up the main branch's CI status because Coveralls reports that coverage dropped. [...] This is a key missing part today both for PRs or for maintaining the main branch [...] In the case of one or more failed jobs, we have no choice but to restart them until they all pass. Alas, doing so today creates the situation described earlier in the thread, which wasn't the case before the change you mentioned.

Questions regarding having "no choice but to restart [failed jobs]":

Do you just mean that, when jobs fail in CI (for whatever reason), you are forced, as a matter of "normal procedure," to manually re-run those jobs from within CI?
Or, are you saying that you're looking for Coveralls to do something beyond what it does now, like, say, upon failing to receive a parallel job it received in a previous build, send a message back to CI to re-run the failed job? In other words, is there a specific request I'm missing here, or are you just pointing out "normal procedure"?

Questions regarding "This is a key missing part...":

Is this not resolved for you now? In other words, is this not "Part A" of the use case resolved on Jun 27, where, if a job fails, you can now re-run it and have it be added to the (previously closed) build at Coveralls?
I understand that you cannot re-run a successful job and have it update itself right now, what I've called "Part B." I'm just trying to understand if there's anything about the way our support for "Part A" works right now that you wish to be different? For instance, right now are you forced to re-run an entire build, rather than the one failing job? What is your ideal workflow step for this situation?: a new force-pushed commit creates new pull_request build in CI with a failing job.

Please know I understand that you basically stated this already. I'm just trying to clarify out any assumptions to make sure I'm not missing something about the desired support for your workflow.

Right now, without further input, my understanding is as follows:

Manually re-running failed jobs (or builds) in CI is "normal procedure"---nothing to be done about that; and
Right now, "Part A" (re-running failed jobs in CI to have them added to a (previously closed) Coveralls build) is supported by Coveralls, and there isn't an additional aspect of support that's desired.

Sometimes, all of our tests pass. Yet, we may be improving previously unreliable tests and we may want to restart one/all successful job(s) [...] I cannot fathom we would be the only victims of "flaky tests" and that no-one else ever needs to restart their failed/successful jobs.

Yes, this makes sense to me, as well. Tests may pass, but you may still improve them by, say, fixing some external I/O and want to simply re-run them. This is the "Part B" that we aim to support asap.

I hope this provides enough information. If I were to sum it all up I think it would be: "the ability to restart all or individual jobs (failed or not) and have the result override the corresponding job in the existing build's state" (or create a new build altogether, with merged jobs).

Yes, appreciate that brief description. That's how it's described internally as well.

Draft spec:

Feature: Revert to previous Coveralls behavior, prior to Apr 27.

Description: Allow Coveralls to receive both new coverage reports, and new versions of previously processed coverage reports, from CI, when they are received. In other words, allow closed parallel builds to be re-opened and modified when new coverage reports are received from the same CI build.

From an end user's perspective: Give end users the ability to restart any or all jobs in an existing CI build (failed or not) and let the new coverage reports update the state of the corresponding Coveralls build.

Details:

Identify when a new coverage report (CI job) is received for an existing, closed parallel build by its CI Build ID.
Identify the incoming coverage report as belonging to a given flag and existing, or not, in the current Coveralls build.
When the coverage report is new (does not exist in current Coveralls build) add it to the build, re-process all jobs in build and recalculate coverage for build.
When the coverage report does exist in the current Coveralls build, either as a carried forward report, or not, replace the existing coverage report with the incoming one.

thibaultcha commented 11 months ago

Great update @afinetooth thank you,

Do you just mean that, when jobs fail in CI (for whatever reason), you are forced, as a matter of "normal procedure," to manually re-run those jobs from within CI?

Yes, this is what I mean. I have an example from when we still had the COVERALLS_SERVICE_NUMBER fix (although without doesn't seem to work either, I will elaborate below):

This commit A had some flaky failures (see the commit's GHA run). After restarting the CI (twice, ugh), the unit job passed. Except that it uploaded its coverage to a new, fresh build: https://coveralls.io/builds/61050538 - Now coverage increased, but only because it is an average of only 1 job out of 8.
Then commit B is later added, it is the next addition atop main that triggers a new CI run. This CI run had all 9 unit tests jobs succeed on their first run, and the coverage at https://coveralls.io/builds/61053814 is correct, but penalized because inferior than the previous (over-inflated) coverage.

As a result, us restarting unit test jobs for Commit A in the GitHub Actions UI has an adverse effect onto future commit's "CI status" -> since the coverage "artificially dropped", future CI jobs become "red" as per the Coveralls job.

When I say "no choice but to restart" (ok, it may be a dramatic exaggeration) what I mean is that these two commits can be seen in the history of our main branch here, and you can see they make it appear like main was at times unstable, while it is only due to the artificial coverage drop. When the latest commit is not passing CI, it also reflects badly on the repository's landing page, where it lists the latest commit. We also don't have a choice but to restart CI for failing PR builds as well, since our GitHub settings require a passing CI job before merging, maybe only a half-exaggeration.

Is this not resolved for you now? In other words, is this not "Part A" of the use case resolved on Jun 27, where, if a job fails, you can now re-run it and have it be added to the (previously closed) build at Coveralls?

If I understand correctly, the main difference between "Part A" and "B" is that the first one focuses on failed CI jobs, while "B" is for successful CI jobs. Is that mostly correct? Because as I was writing the workflow comment, I was under the impression that neither were actually working. I then thought it could be due to the COVERALLS_SERVICE_NUMBER fix still being there, so I removed it a couple weeks ago. The issue then happened again last week, when this commit which had some failed jobs was restarted (see its attached GHA run), and produced this Coveralls job upon restarting the failed job: https://coveralls.io/builds/61768808

For instance, right now are you forced to re-run an entire build, rather than the one failing job?

I have not tried this as a remedy for the above, mostly because running all CI jobs would run the risk of encountering more I/O failures in other unit jobs that previously passed...

It goes without saying we have been and continue reducing all instances of potential external I/O failures (because restarting tests in always super annoying and feels wrong), but I don't think we can ever expect any codebase or test suite to be perfect all the time; in the future we expect to incorporate more large swaths of vendor-ed test cases which may include external I/O test cases we have no direct control over, which may be cause instability again for some time again, etc...

And otherwise the draft spec looks good, I believe it completes all use-cases.

Thank you

afinetooth commented 11 months ago

Thanks, @thibaultcha.

Just to keep it short, using smaller excerpts:

As a result, us restarting unit test jobs for Commit A in the GitHub Actions UI has an adverse effect onto future commit's "CI status" -> since the coverage "artificially dropped", future CI jobs become "red" as per the Coveralls job.

Yes. Totally get it. Thanks for the example as well. The problem with that UX is clear. We aim to fix that.

If I understand correctly, the main difference between "Part A" and "B" is that the first one focuses on failed CI jobs, while "B" is for successful CI jobs. Is that mostly correct?

Yes. That's fully correct, to my current knowledge, which is that the COVERALLS_SERVICE_NUMBER fix should no longer be required, and that Coveralls will take into consideration the run-attempt under-the-hood.

[...] as I was writing the workflow comment, I was under the impression that neither were actually working.

OK. Looking at these :eyes::

OK, so this is what I'm seeing in terms of the coverage reports we received for the two jobs present in that build:

Job 1:

{
  "service_branch": "main",
  "service_build_url": "https://github.com/Kong/ngx_wasm_module/actions/runs/5747743858",
  "service_job_id": "tests",
  "service_job_url": "https://github.com/Kong/ngx_wasm_module/actions/runs/5747743858",
  "service_name": "github",
  "service_number": "5747743858-2",
  "service_attempt": "2",
  "commit_sha": "d2da5e439208d11715d434e7315d8e6d8c05a81a",
  "repo_name": "Kong/ngx_wasm_module",
  "repo_token": "ghs_3KurGbAsfmH616C6rtleBfrpvpctQW4PJY39",
  "flag_name": "unit-openresty_1.21.4.1-wasmtime-ssl-debug-no_hup-static",
  "source_files": [ ...],
  "parallel": true,
  "git": {
    "branch": "main",
    "head": {
      "id": "d2da5e439208d11715d434e7315d8e6d8c05a81a",
      "committer_email": "noreply@github.com",
      "committer_name": "GitHub",
      "author_email": "thibaultcha@users.noreply.github.com",
      "author_name": "Thibault Charbonnier",
      "message": "chore(deps) bump OpenSSL to 3.1.2"
    }
  },
  "run_at": "2023-08-03T16:26:53+00:00"
}

Job 2:

{
  "service_branch": "main",
  "service_build_url": "https://github.com/Kong/ngx_wasm_module/actions/runs/5746101886",
  "service_job_id": "tests",
  "service_job_url": "https://github.com/Kong/ngx_wasm_module/actions/runs/5746101886",
  "service_name": "github",
  "service_number": "5746101886-2",
  "service_attempt": "2",
  "commit_sha": "88723c38d29907a53cb9bbc6e9465cf6f3e9c708",
  "repo_name": "Kong/ngx_wasm_module",
  "repo_token": "ghs_jpnwe4B7Tyv4cENCpYpC9i4MMvkokC1HvODk",
  "flag_name": "unit-ngx_1.25.1-v8-ssl-debug-no_hup-static",
  "source_files": [...],
  "parallel": true,
  "git": {
    "branch": "main",
    "head": {
      "id": "88723c38d29907a53cb9bbc6e9465cf6f3e9c708",
      "committer_email": "thibaultcha@users.noreply.github.com",
      "committer_name": "Thibault Charbonnier",
      "author_email": "thibaultcha@me.com",
      "author_name": "Thibault Charbonnier",
      "message": "chore(tests) update trap regexes for new format"
    }
  },
  "run_at": "2023-08-03T06:35:23+00:00"
}

So, those parameters, specifically "service_number": "5747743858-2", suggest that that COVERALLS_SERVICE_NUMBER fix was still being applied there.

Otherwise, Coveralls should have "re-opened" and added 1-2 new jobs to a build with ID 5747743858.

But I'm not seeing a build with that ID in your Coveralls build history.

I'll wait to hear back from you about that one; so I can then look closer. (I'm thinking there may have been a build with that service ID that was never closed.)

In the meantime, to finish:

I have not tried this as a remedy for the above, mostly because running all CI jobs would run the risk of encountering more I/O failures in other unit jobs that previously passed...

OK. Just so you know, in terms of future support it shouldn't matter whether you do one or the other---re-run one or more jobs, or re-run the full build. I was just wondering if you were getting different results for either of those two scenarios.

And otherwise the draft spec looks good, I believe it completes all use-cases.

Great. Thank you. Talk soon.

thibaultcha commented 11 months ago

Oh right! You are correct, I misread the history, these jobs did still have the COVERALLS_SERVICE_NUMBER fix...

Otherwise, Coveralls should have "re-opened" and added 1-2 new jobs to a build with ID 5747743858.

So, does that mean Part A is working as intended as of today? Are these 1-2 new jobs merged into the existing build, or are they added to a new build containing only 2 jobs (which would not be the desired behavior ofc)? I will try and see how it behaves the next time it occurs, on main or in a PR... I just fixed one of our most unstable I/O tests so these occurrences should be even less frequent for now...

afinetooth commented 11 months ago

So, does that mean Part A is working as intended as of today? Are these 1-2 new jobs merged into the existing build, or are they added to a new build containing only 2 jobs (which would not be the desired behavior ofc)?

Yes, Part A should be working today, which is to say:

As long as those two prior jobs had failed prior (and not yet made it into the existing build at Coveralls) they will be added to the existing build.
But if the job(s) (coverage report(s)) received already existed in the build, then a new build will be generated with those (new) jobs.

Part B will make sure the (new) jobs just replace their previous versions in the existing build.

I should also say:

Given the carryforward: all setting, the new build with the (new) jobs should continue to populate itself into a full build with carried-forward jobs.

I will try and see how it behaves the next time it occurs, on main or in a PR... I just fixed one of our most unstable I/O tests so these occurrences should be even less frequent for now...

Great. I would like to verify it's current working state per the above, now, particularly, the carried-forward jobs. I do know you weren't seeing those before, and it does seem to me that they should have been added to this build. There may be a piece to fix there.

Standing by.