Open jinhong- opened 2 years ago
hi @jinhong-,
I took a guess at the project and build you're referencing above. (I'm using it here since only you and your team members can access that link.)
Looking at the PR build (RIGHT), compared to the PR's base build (LEFT):
I'm also confused at the coverage change, since I don't see any indicators of that difference in the files themselves, only in the RUN DETAILS, which correlates with the change.
But I suspect those numbers are wrong and that the build may have been corrupted, so I re-ran the parallel jobs in the order they arrived, and got a new result (a new coverage calculation), which confirms that.
There is now NO CHANGE in coverage between the two builds:
I'm afraid there's little to go on when it comes to one-off corrupted builds, in terms of determining root cause. We would need to a see a pattern, so please let us know, here, or at support@coveralls.io, if this kind of result persists.
In the meantime, to answer your other question:
The number reported in the Tree is different from the coverage decrease reported
The reason for that is that the FILES section only displays the line coverage in your project's coverage report, whereas your Coveralls repo, as configured in its SETTINGS, also tracks the branch coverage in your coverage reports and considers it in calculating total (aggregate) coverage.
The aggregate coverage is reflected in RUN DETAILS, and you'll note that yours includes branch coverage details:
Here's the formula for aggregate coverage when branch coverage is included:
aggregate coverage w/ branch coverage = (lines hit + branches hit) / (relevant lines + relevant branches)
And here's the branch coverage setting in your project SETTINGS:
Thanks! We will keep observing. It seems to be happening fairly frequently in terms of mis reporting of coverage. Does the order of parallel execution matter? Also, is there an explanation of what branch coverage means?
Here's another that failed with 0% again https://coveralls.io/builds/48934608
@jinhong- Yes, I see the same behavior again. But I don't see any underlying reason for it. There is nothing out of order with your coverage posts, or how they came in. They are in a different order in the PR, but we're aware of that. (Coveralls knows that the previous job for build 2 in the PR build is job 3 in the base build, etc.)
The "reasoning" for the drop in coverage comes from the RUN DETAILS, and you can see how that is purported to change between the base build (LEFT) and the PR build (RIGHT), here:
Those RUN DETAILS are supposed to be the un-modified details from your coverage reports.
So the first thought about root cause is an issue with your reports. But I can eliminate that if I re-run your PR build and get more accurate results, like last time.
Which I did and...
Again, the RUN DETAILS changed after Coveralls re-consumed each of your coverage reports (jobs).
So we have our pattern. But unfortunately I still can't name the cause.
Obviously, there's something interfering in the consumption of coverage reports the first time around.
The next step in terms of diagnosing would be for you to invoke your next builds in verbose mode and share with me your CI build logs. (At least the portions related to Coveralls.)
I know your project is private, so feel free to share those to support@coveralls.io and just mention this issue. I will look for them.
Here's how to enable verbose mode for the Coveralls Github Action:
NODE_COVERALLS_DEBUG=1
- The NODE_
part is there because the Coveralls Github Action is running the node-coveralls integration under-the-hood.Thanks.
I have sent the logs over to you
@jinhong- Got them Thanks. Will reply in email but backfill any details here that will help others.
Thanks, @dhui. @jinhong- is that a viable workaround for you for the time being?
In your SETTINGS, you would enter 0.1 into the COVERAGE DECREASE THRESHOLD FOR FAILURE field, like so:
We are trying to determine what kind of debug info / monitoring would help us understand what's happening with your initial builds that don't calculate properly.
Unfortunately that may not help us as for our case, the coverage seems to drop down to zero
@afinetooth how are you re-running the tests in the order they are arriving? Are you able to expose this functionality for me to re-run? I am facing this issue fairly frequently
My theory is that the analysis is timing out/erroring out on your backend, and the behavior of timing out is to have coverage reported at 0%. I observed that the results took longer than usual to arrive. I am assuming there is some background processing involved
@jinhong- Unfortunately it's not something I can expose for you to trigger. Right now, it's just an internal command I can execute via dev console, so not available via API or anything. It is planned for future release, but probably not on a timeline to be of use here.
Your theory is reasonable. To test it I ran a report of your last 100 builds (attached---it's anonymized.) and your build times all look normal, except for the original build ID referenced above: 48890929. It's one of the only builds with a longer build time and in that case the build time is an extreme outlier.
Maybe you can look through the file and see if the IDs of any more of your problem builds match longer build times. (I don't really see any builds that took as long at the one mentioned, though.) Note that the build ID is what's displayed in the URL for your build, not the label given by your CI service that appears on your build pages.
My theory is that the analysis is timing out/erroring out on your backend, and the behavior of timing out is to have coverage reported at 0%. I observed that the results took longer than usual to arrive. I am assuming there is some background processing involved
Also, I noticed the exact same PR/build (2345857520) would first fail with 0%, then pass afterwards. Did you trigger a re-run for build 2345857520?
Few questions on the CSV file
having the same issue
Hi @chapayevdauren. I'll need to know the Coveralls URL for your repo, or the URL for the problematic build.
If it's private, or sensitive, please email support@coveralls.io and mention this issue. I'll get it and reply.
@chapayevdauren — replied in email.
Thanks, @dhui. @jinhong- is that a viable workaround for you for the time being?
In your SETTINGS, you would enter 0.1 into the COVERAGE DECREASE THRESHOLD FOR FAILURE field, like so:
We are trying to determine what kind of debug info / monitoring would help us understand what's happening with your initial builds that don't calculate properly.
@afinetooth
I triggered a re-run after setting the Coverage Decrease Threshold for failure
for the repo to 0.1%
and now the commit status shows as passing. This temporary work around should work for us for now since the code base isn't huge. e.g. w/ 6.5K LOC, 0.1% would mean that ~6 LOC could lose coverage and the coverage checks would still pass. Could we set this value lower? e.g. .01% or .001%
Thanks @dhui, for the update.
@dhui and @chapayevdauren, I also have an update: We are seeing this pattern in several other customer repos right now. Which is to say, intermittent PR builds showing 0% coverage, caused by an incorrect aggregate coverage calculation, which is corrected by re-running / re-playing the original jobs.
We don't currently understand the root cause, but we think it may be due to some recent issues described on our status page: https://status.coveralls.io/
That said, the normal behavior would be for the builds to take longer than normal, not complete incorrectly. So if the above is a cause, it's for a different reason, such as the calculation job failing before it can obtain its data, due to a timeout, etc.
Will share updates here.
@jinhong- @dhui @chapayevdauren @
While we've not yet identified a fix for this issue, we released a workaround today that should resolve it for you: the Rerun Build Webhook.
Since the nature of the issue appears to be that, for some repos with parallel builds:
A Rerun Build Webhook, similar to the (Close) Parallel Build Webhook, fixes the issue by triggering your build to re-calculate itself.
Call this at the end of your CI config, after calling the (Close) Parallel Build Webhook.
Call it like this:
curl --location --request GET 'https://coveralls.io/rerun_build?repo_token=<YOUR REPO TOKEN>&build_num=<YOUR BUILD NUMBER>'
But substitute your repo_token
, and your build_num
(the same value you used for build_num
in your (Close) Parallel Build Webhook).
Please note a few differences between the Rerun Build Webhook and the (Close) Parallel Build Webhook:
/rerun_build
endpoint will accept a GET
or a POST
, and_repo_token
and the build_num
, and build_num
is a regular URL param and not part of a JSON body called "payload" as required by the (Close) Parallel Build Webhook:_curl -k https://coveralls.io/webhook?repo_token=<YOUR REPO_TOKEN> -d "payload[build_num]=<YOUR BUILD NUMBER>&payload[status]=done"
NOTE: In case you're having trouble determining what build_num
is for your project, I posted some follow-up here.
If you're using a different Coveralls integration and/or are still having trouble determining the correct values for either build_num
or repo_token
let me know here, or in the context of your issue, or at support@coveralls.io.
Am struggling with understanding why coverage reports a decrease when there is no change in codes that affects the coverage. Attached the screenshot below. The number reported in the Tree is different from the coverage decrease reported