Help understanding coverage decrease

jinhong- commented 2 years ago

Am struggling with understanding why coverage reports a decrease when there is no change in codes that affects the coverage. Attached the screenshot below. The number reported in the Tree is different from the coverage decrease reported

afinetooth commented 2 years ago

hi @jinhong-,

I took a guess at the project and build you're referencing above. (I'm using it here since only you and your team members can access that link.)

Looking at the PR build (RIGHT), compared to the PR's base build (LEFT):

Screen Shot 2022-05-06 at 10 28 58 AM

I'm also confused at the coverage change, since I don't see any indicators of that difference in the files themselves, only in the RUN DETAILS, which correlates with the change.

But I suspect those numbers are wrong and that the build may have been corrupted, so I re-ran the parallel jobs in the order they arrived, and got a new result (a new coverage calculation), which confirms that.

There is now NO CHANGE in coverage between the two builds:

Screen Shot 2022-05-06 at 10 32 20 AM

I'm afraid there's little to go on when it comes to one-off corrupted builds, in terms of determining root cause. We would need to a see a pattern, so please let us know, here, or at support@coveralls.io, if this kind of result persists.

In the meantime, to answer your other question:

The number reported in the Tree is different from the coverage decrease reported

The reason for that is that the FILES section only displays the line coverage in your project's coverage report, whereas your Coveralls repo, as configured in its SETTINGS, also tracks the branch coverage in your coverage reports and considers it in calculating total (aggregate) coverage.

The aggregate coverage is reflected in RUN DETAILS, and you'll note that yours includes branch coverage details:

Screen Shot 2022-05-06 at 10 41 14 AM

Here's the formula for aggregate coverage when branch coverage is included:

aggregate coverage w/ branch coverage = (lines hit + branches hit) / (relevant lines + relevant branches)

And here's the branch coverage setting in your project SETTINGS:

Screen Shot 2022-05-06 at 10 48 16 AM

jinhong- commented 2 years ago

Thanks! We will keep observing. It seems to be happening fairly frequently in terms of mis reporting of coverage. Does the order of parallel execution matter? Also, is there an explanation of what branch coverage means?

jinhong- commented 2 years ago

Here's another that failed with 0% again https://coveralls.io/builds/48934608

afinetooth commented 2 years ago

@jinhong- Yes, I see the same behavior again. But I don't see any underlying reason for it. There is nothing out of order with your coverage posts, or how they came in. They are in a different order in the PR, but we're aware of that. (Coveralls knows that the previous job for build 2 in the PR build is job 3 in the base build, etc.)

The "reasoning" for the drop in coverage comes from the RUN DETAILS, and you can see how that is purported to change between the base build (LEFT) and the PR build (RIGHT), here:

Screen Shot 2022-05-09 at 3 58 18 PM

Those RUN DETAILS are supposed to be the un-modified details from your coverage reports.

So the first thought about root cause is an issue with your reports. But I can eliminate that if I re-run your PR build and get more accurate results, like last time.

Which I did and...

Screen Shot 2022-05-09 at 4 02 14 PM

Again, the RUN DETAILS changed after Coveralls re-consumed each of your coverage reports (jobs).

So we have our pattern. But unfortunately I still can't name the cause.

Obviously, there's something interfering in the consumption of coverage reports the first time around.

The next step in terms of diagnosing would be for you to invoke your next builds in verbose mode and share with me your CI build logs. (At least the portions related to Coveralls.)

I know your project is private, so feel free to share those to support@coveralls.io and just mention this issue. I will look for them.

Here's how to enable verbose mode for the Coveralls Github Action:

Add this environment variable to your GHA config yaml so it's available to your Coveralls step(s) - NODE_COVERALLS_DEBUG=1 - The NODE_ part is there because the Coveralls Github Action is running the node-coveralls integration under-the-hood.
Make sure you're getting verbose output in your CI build logs - You'll see a bunch of lines starting with "Debug" so it should be clear.
Share the verbose build log for all three (3) parallel jobs, and for your parallel build close webhook - So you'll share four (4) CI build logs total with me, at support@coveralls.io.

Thanks.

jinhong- commented 2 years ago

I have sent the logs over to you

afinetooth commented 2 years ago

@jinhong- Got them Thanks. Will reply in email but backfill any details here that will help others.

dhui commented 2 years ago

I believe we're seeing a similar issue (for a while now) where the builds are incorrectly reporting a -0.0% decrease in coverage but the aggregate is not. I've changed the Coverage Decrease Threshold for failure for the repo to 0.1% as a workaround.

Example:

Screenshot
build
commit

afinetooth commented 2 years ago

Thanks, @dhui. @jinhong- is that a viable workaround for you for the time being?

In your SETTINGS, you would enter 0.1 into the COVERAGE DECREASE THRESHOLD FOR FAILURE field, like so:

Screen Shot 2022-05-16 at 10 56 22 AM

We are trying to determine what kind of debug info / monitoring would help us understand what's happening with your initial builds that don't calculate properly.

jinhong- commented 2 years ago

Unfortunately that may not help us as for our case, the coverage seems to drop down to zero

jinhong- commented 2 years ago

@afinetooth how are you re-running the tests in the order they are arriving? Are you able to expose this functionality for me to re-run? I am facing this issue fairly frequently

jinhong- commented 2 years ago

My theory is that the analysis is timing out/erroring out on your backend, and the behavior of timing out is to have coverage reported at 0%. I observed that the results took longer than usual to arrive. I am assuming there is some background processing involved

afinetooth commented 2 years ago

@jinhong- Unfortunately it's not something I can expose for you to trigger. Right now, it's just an internal command I can execute via dev console, so not available via API or anything. It is planned for future release, but probably not on a timeline to be of use here.

Your theory is reasonable. To test it I ran a report of your last 100 builds (attached---it's anonymized.) and your build times all look normal, except for the original build ID referenced above: 48890929. It's one of the only builds with a longer build time and in that case the build time is an extreme outlier.

last_100_builds_20220518.csv

Maybe you can look through the file and see if the IDs of any more of your problem builds match longer build times. (I don't really see any builds that took as long at the one mentioned, though.) Note that the build ID is what's displayed in the URL for your build, not the label given by your CI service that appears on your build pages.

jinhong- commented 2 years ago

My theory is that the analysis is timing out/erroring out on your backend, and the behavior of timing out is to have coverage reported at 0%. I observed that the results took longer than usual to arrive. I am assuming there is some background processing involved

Also, I noticed the exact same PR/build (2345857520) would first fail with 0%, then pass afterwards. Did you trigger a re-run for build 2345857520?

Few questions on the CSV file

Does the build time represent how long server takes to process and/or time taken for the first of N parallel requests posted to coveralls?
I noticed there are duplicate builds. Is each build represented by each trigger from GitHub actions in this case? or is it represented by each parallel request posted to coveralls?
I noticed there are many incomplete builds. What do those mean?

chapayevdauren commented 2 years ago

having the same issue

ergeon:srv-ergeon | Build #3c114c30-b0bc-43a2-9638-8cc0f0e7d896 | Coveralls - Test Coverage History Statistics 2022-05-19 18-06-19

afinetooth commented 2 years ago

Hi @chapayevdauren. I'll need to know the Coveralls URL for your repo, or the URL for the problematic build.

If it's private, or sensitive, please email support@coveralls.io and mention this issue. I'll get it and reply.

afinetooth commented 2 years ago

@chapayevdauren — replied in email.

dhui commented 2 years ago

Thanks, @dhui. @jinhong- is that a viable workaround for you for the time being?

In your SETTINGS, you would enter 0.1 into the COVERAGE DECREASE THRESHOLD FOR FAILURE field, like so:

We are trying to determine what kind of debug info / monitoring would help us understand what's happening with your initial builds that don't calculate properly.

@afinetooth I triggered a re-run after setting the Coverage Decrease Threshold for failure for the repo to 0.1% and now the commit status shows as passing. This temporary work around should work for us for now since the code base isn't huge. e.g. w/ 6.5K LOC, 0.1% would mean that ~6 LOC could lose coverage and the coverage checks would still pass. Could we set this value lower? e.g. .01% or .001%

afinetooth commented 2 years ago

Thanks @dhui, for the update.

@dhui and @chapayevdauren, I also have an update: We are seeing this pattern in several other customer repos right now. Which is to say, intermittent PR builds showing 0% coverage, caused by an incorrect aggregate coverage calculation, which is corrected by re-running / re-playing the original jobs.

We don't currently understand the root cause, but we think it may be due to some recent issues described on our status page: https://status.coveralls.io/

That said, the normal behavior would be for the builds to take longer than normal, not complete incorrectly. So if the above is a cause, it's for a different reason, such as the calculation job failing before it can obtain its data, due to a timeout, etc.

Will share updates here.

afinetooth commented 2 years ago

@jinhong- @dhui @chapayevdauren @

Workaround for this issue: the Rerun Build Webhook

While we've not yet identified a fix for this issue, we released a workaround today that should resolve it for you: the Rerun Build Webhook.

Since the nature of the issue appears to be that, for some repos with parallel builds:

The coverage % of pull_request builds is calculated incorrectly (on the first build), but
Re-running the build (recalculating coverage) corrects the coverage %.

A Rerun Build Webhook, similar to the (Close) Parallel Build Webhook, fixes the issue by triggering your build to re-calculate itself.

Instructions

Call this at the end of your CI config, after calling the (Close) Parallel Build Webhook.

Call it like this:

curl --location --request GET 'https://coveralls.io/rerun_build?repo_token=<YOUR REPO TOKEN>&build_num=<YOUR BUILD NUMBER>'

But substitute your repo_token, and your build_num (the same value you used for build_num in your (Close) Parallel Build Webhook).

Please note a few differences between the Rerun Build Webhook and the (Close) Parallel Build Webhook:

_The /rerun_build endpoint will accept a GET or a POST, and_
_The only two required URL params are your repo_token and the build_num, and build_num is a regular URL param and not part of a JSON body called "payload" as required by the (Close) Parallel Build Webhook:_

curl -k https://coveralls.io/webhook?repo_token=<YOUR REPO_TOKEN> -d "payload[build_num]=<YOUR BUILD NUMBER>&payload[status]=done"

afinetooth commented 2 years ago

NOTE: In case you're having trouble determining what build_num is for your project, I posted some follow-up here.

If you're using a different Coveralls integration and/or are still having trouble determining the correct values for either build_num or repo_token let me know here, or in the context of your issue, or at support@coveralls.io.

lemurheavy / coveralls-public

Help understanding coverage decrease #1632

Workaround for this issue: the Rerun Build Webhook

Instructions