lemurheavy / coveralls-public

The public issue tracker for coveralls.io
http://coveralls.io
124 stars 7 forks source link

Merging covered relevant lines from different jobs #1514

Open aenachescu opened 3 years ago

aenachescu commented 3 years ago

If I have a build with more jobs the number of relevant lines is not computed as it should be. As far as I understand a line is considered relevant if it is a relevant line in all jobs. Shouldn't a line considered relevant if it is a relevant line in at least one job? For example: Job1: https://coveralls.io/jobs/72975052/source_files/4640888347 Relevant lines: 32, 36, 49, 52, 54, 56, 58, 60, 92, 94, 109, 111, 127, 129 Job2: https://coveralls.io/jobs/72975096/source_files/4640912663 Relevant lines: 32, 36, 66, 69, 71, 73, 75, 77, 96, 98, 100, 102, 109, 111, 127, 129 Final result: https://coveralls.io/builds/36073657/source?filename=test/check_build_configuration/main.c Relevant lines: 32, 36, 109, 111, 127, 129

I would expect the relevant lines to be: 32, 36, 49, 52, 54, 56, 58, 60, 66, 69, 71, 73, 75, 77, 92, 94, 96, 98, 100, 102, 109, 111, 127, 129.

Is there any solution to fix my problem?

afinetooth commented 3 years ago

@aenachescu I agree. Relevant lines should be considered relevant in all jobs, but only if all things remain equal, which is to say, if the same coverage library is being used in both jobs, instrumenting the same code files in the same way.

Is it possible that is different across your two jobs?

If not, then the next step will be to re-run the previous job (Job #27) in verbose mode so we can see exactly what's being sent to the coveralls API.

Coveralls does not determine what lines of code are considered relevant. Your test coverage library does that, first, then CI sends us that determination in the form of your coverage report. We simply read in the coverage report and log it for each job, each build. For a given build, we do combine the coverage reports for each job, but, again, we don't determine which lines are relevant.

If you re-run Job #27 in verbose mode, you should find that in the generated JSON (and potentially, one step before in the generated LCOV), the number of relevant lines has already been determined, and I suspect it will be as you describe above / as follows:

/test/check_build_configuration/main.c:

It is very strange to me that the combined result indicates only 6 relevant lines: https://coveralls.io/builds/36073657

Screen Shot 2021-01-05 at 1 17 20 PM

The fact that this result is the union of relevant lines across the two jobs makes me think we're getting covered lines in place of relevant lines on the incoming reports.

That's why it will help to see verbose output from the CI build logs for both of those jobs (27.1 & 27.3).

The method to invoke verbose mode depends on what language integration you're using.

Per our C/C++ listings, I'll assume yours is: cpp-coveralls. Per usage instructions there, it seems you can pass the --verbose flag to the coveralls command.

aenachescu commented 3 years ago

Is it possible that is different across your two jobs?

No, in the first job I test the code using gcc and in the second job I test the code using g++ compiler.

In order to report the coverage to coveralls I use coveralls-lcov tool. I passed --verbose flag to coveralls-lcov and I got these json outputs:

afinetooth commented 3 years ago

Hi @aenachescu, thanks for the details.

So, at a high level, I'm afraid I need to kick the issue back to you.

In other words, there's nothing specific to Coveralls that seems to be acting here. We simply consume the coverage reports we receive, via POSTs to our API, and, from the incoming JSON shown above, you can see the issue starts there.

That is to say that, between the two jobs, for the same file, test/check_build_configuration/main.c, the number of relevant lines cited in each job's report are different. Job 30.1 reports 14 relevant lines, while Job 30.3 reports 16 relevant lines.

This indicates the origin of the issue is with the coverage library / coverage tooling native to your project. In our experience, if the same code is being tested by the same tests, using the same coverage library, there shouldn't be any difference in what that coverage library determines to be the relevant lines for a given file, regardless of the number of passes.

I would like to be more helpful, and for some projects I could be, but the nature of your project falls outside my domain of expertise. I would need you to answer a bunch of questions I have about your project and, even then, I think it's unlikely I'll be able to identify the source of the discrepancy.

That said, I'm willing to help, so, if you'd like more help, let me know and I'll start with my questions.

aenachescu commented 3 years ago

Hi @afinetooth Somehow the origin of the issue is from gcov. I mean, there are a lot of issues when you try to compute the coverage for template functions/classes in C++ or for code under ifdef (i dont know if you are familiar with C/C++). In my case, I have some code that is compiled only if the compiler is g++ and other code that is compiled only if the compiler is gcc. And when the compiler is g++, the code that is compiled only if the compiler is gcc wont be seen as relevant code because it is not included in the binary. Same when the compiler is gcc. And I can understand that...

But my question is why you do intersection on relevant lines and not union? I see in the json a coverage array that contains null or the number of hits for nth line. In my case there are 2 json outputs and each one has a source file test/check_build_configuration/main.c and this file has a coverage array in each job. Isn't possible to do union on these arrays?

afinetooth commented 3 years ago

@aenachescu aha. No, I don't have that expertise, but that was going to be my guess, that somehow the different compilers render different lines of the original source relevant / irrelevant. So that helps to know, and seems the likely source of discrepancy.

Your point---why intersection and not union---is actually something I don't understand, that is happening on our side. Please let me examine it and get back to you. I suspect that we expect relevant lines to be relevant (non-null) across all jobs and, therefore, only sum hits for lines that are non-null across all jobs. (We re-build that coverage hash for each file before calculating aggregate coverage.) Let me verify.

vtjnash commented 2 years ago

I just noticed this problem appears to still happening on the coveralls reports for my latest build: https://coveralls.io/builds/46667478

But there's a lot going on here to try to unpack:

Is there an update on this?

afinetooth commented 2 years ago

Hi @vtjnash.

So at this point, I noticed that you've run at least one more build against that PR and, as designed, the latest PR Comment replaced the previous one. It has coverage at 92% and looks like this:

Screen Shot 2022-02-21 at 12 43 15 PM

That badge also links back to the more recent build: https://coveralls.io/builds/46695736

So I'm going to try to address the same concerns for that build, but I need some help.

You said:

  • The "true" value for coverage on this build is probably 90.16%

Can you clarify why you think that? And what you believe coverage is on the new build?

I noticed you also said:

The summary view shows 89.89% covered

Which I take to be this view from the old build:

Screen Shot 2022-02-21 at 1 04 03 PM

Don't let the value of any one file---even in a project with just one file---mislead you regarding aggregate coverage for the project. There are a couple of reasons why the coverage of one file may not lead intuitively to the aggregate coverage, or, in the case of this one-file project, might not equal the aggregate coverage:

  1. First, the coverage shown for individual files is limited to line coverage, while some projects have reports that also include branch coverage. I don't believe yours does, but this is a common source of confusion.
  2. Second (and likely applicable to your use case), the coverage for the file is derived from several jobs, each of which may have conflicting coverage data for the file. In this case, until confirmed by a review of each coverage report, we can probably assume the source of the discrepancy is what @aenachescu points to above, that our algorithm appears to take the intersection and not union of relevant lines across all "instances" of the same file.

I have been able to confirm that this is designed into the algorithm, and not just an unhandled corner case.

Comments on the aggregation method explain the choice:

If a line is not relevant in one file it should be irrelevant in all. Fix for some coverage libraries which fail to report correct relevancy when file has no coverage

I'm not 100% sure what "fail to report correct relevancy" and "when file has no coverage" mean, or how often such cases occur, but I've asked that question and submitted the situation described in this issue as a conflicting use case that may warrant more nuanced approach.

Thinking about workarounds until such time that this could be addressed...

At the moment, all I can come up with is to:

  1. Rewrite your file in a way that would avoid discrepancies in relevant lines for the main body of your code. Not sure how, but perhaps by splitting it, or by rewriting the statements on the offending line(s) that would force them to always be considered relevant.
  2. Look into your coverage library and see if there's a setting that allows you to force a line to be considered relevant (or perhaps a comment on the line that the library would recognize).
vtjnash commented 2 years ago

Looking at codecov on the same commit, and examining the build logs for individual jobs to see that 92% is overestimating the total (missing some lines that are only relevant to some platforms, but uncovered on all platforms) and 89% is underestimating (counting as uncovered some lines that are only relevant to other platforms). So 90% seemed a fair estimate for the aggregation of the lines covered/relevant when aggregating running tests across all platforms.

Given the reported value there, I think the aggregate coverage in the comment might be wrong due to:

  1. Thirdly, the aggregate coverage is not computed at all when updating the PR comment, but simply uses the value of the last job run, even if using the parallel flag.
afinetooth commented 2 years ago

Looking at codecov on the same commit, and examining the build logs for individual jobs to see that 92% is overestimating the total (missing some lines that are only relevant to some platforms, but uncovered on all platforms) and 89% is underestimating (counting as uncovered some lines that are only relevant to other platforms). So 90% seemed a fair estimate for the aggregation of the lines covered/relevant when aggregating running tests across all platforms.

Ok, got it. Thanks.

Given the reported value there, I think the aggregate coverage in the comment might be wrong due to: 3. Thirdly, the aggregate coverage is not computed at all when updating the PR comment, but simply uses the value of the last job run, even if using the parallel flag.

Sorry, can you please clarify where "there" is when you say:

Given the reported value there [...]

Can you also explain why you called this (3)?

  1. Thirdly, the aggregate coverage is not computed at all when updating the PR comment, but simply uses the value of the last job run, even if using the parallel flag.

Is it meant to be another reason (besides the two I mention above) why the file coverage may differ from the aggregate coverage?

Also, AFAIK this is not correct, unless I misunderstand the context:

the aggregate coverage is not computed at all when updating the PR comment, but simply uses the value of the last job run, even if using the parallel flag.

I think I might know what you mean. In actuality, coverage is re-computed for every PR commit. There is another issue there that is masking that.

If you're talking about the Status Updates for your PR, the issue is that you should either be seeing status updates for all jobs + a final (aggregate) status update for the build, OR just a status update for the build, depending on your settings. But due to the bug, for some reason, in your PR I am seeing two (2) status updates, one for your final job + one for the overall build. Aside from the bug, though, the status update for the build should always show the aggregate coverage and link back to the build page, which represents the aggregate coverage.

If you're talking about the PR comment on your PR, the badge image should represent aggregate coverage and always links back to the build page, which represents aggregate coverage.

As it does here, for me: https://github.com/vtjnash/Pidfile.jl/pull/11

Screen Shot 2022-02-22 at 11 27 43 AM

Screen Shot 2022-02-22 at 11 28 03 AM

vtjnash commented 2 years ago

Look into your coverage library and see if there's a setting that allows you to force a line to be considered relevant (or perhaps a comment on the line that the library would recognize).

Yeah, this is not ideal, but some of the builds are from old versions of the build tool, which have less reliable information on relevant lines. I perhaps should only upload builds for the latest version, as intersecting them in this way is leading to even worse reported statistics.

Is it meant to be another reason (besides the two I mention above) why the file coverage may differ from the aggregate coverage?

Yes, I am just guessing there at why it seems to agree exactly with the last build, but disagree with the other places that this is being reported.

I am not sure why that status reporting is odd, since I have the official uploader here, and thought this was the expected configuration: https://github.com/vtjnash/Pidfile.jl/blob/4466af1b9abe1484ed493502a8f1501bc550fc3e/.github/workflows/CI.yml#L53-L62

          parallel: true
  finish:
    needs: test
    runs-on: ubuntu-latest
    steps:
    - name: Coveralls Finished
      uses: coverallsapp/github-action@master
      with:
        github-token: ${{ secrets.github_token }}
        parallel-finished: true
afinetooth commented 2 years ago

First, your CI config looks fine. Thanks for sharing it.

Yes, I am just guessing there at why it seems to agree exactly with the last build, but disagree with the other places that this is being reported.

I see what you mean. I had not noticed that before, only that the figure correlated with the overalls build.

It does seem odd that the last build would be identical to the overall coverage for the build, and not be closer to an average of all the jobs.

I re-ran the jobs for your previous build---this one: https://coveralls.io/builds/46667478

And got something different for the build coverage %:

Instead of 91.6% it now reads 89.888%, which seems closer to an appropriate coverage given the constituent jobs:

Screen Shot 2022-02-23 at 2 13 27 PM

I don't know why it happened, but it suggests a one-off issue.

To validate, I re-ran the jobs on the newer build as well: https://coveralls.io/builds/46695736

And similarly, the overall coverage changed again, this time to 91.346%.

So, if it is a one-off, it still happened consistently for two builds in a row.

To test further, I re-ran the jobs for the later build again (and again) to see if I continued getting the same overall coverage %---which should be what always happens---and I did consistently get the same result, 91.346%.

So I think there's an issue at play having to do with your build not closing and calculating aggregate coverage, but just taking the last build for the aggregate coverage.

I haven't seen this before and, like I said, I see nothing in your CI config that indicates why it would happen.

Can you verify if this is the case for any other builds?

afinetooth commented 2 years ago

P.S. I tried it on two more of your PR builds (the last build for each PR):

And aggregate coverage for both builds changed slightly after I re-ran the jobs:

https://coveralls.io/builds/44414158 (for PR 9)

And:

https://coveralls.io/builds/37426114 (for PR 7)

I'm looking into what could have happened here. Will feed back asap.

afinetooth commented 2 years ago

Hi, @vtjnash. I can't find any more cases of PR builds that display the behavior where overall coverage matches the coverage of the last job.

For instance, this build for PR 5 looks normal, so apparently the issue doesn't go that far back.

Unfortunately, we don't keep the data on builds that are replaced by subsequent builds (which is what happens when I manually re-run a build), so I will need you to inform me the next time you see the issue so I can examine your new build. (I probably got a little overzealous when I re-ran the builds for PRs 9 & 7.)

As far as possible causes, I did notice that, across these three (3) builds for PR 9, you have a varying number of parallel jobs. Specifically:

Do you think it's possible that's due to an errant or duplicate job?

I looked to your latest build to see what your "canonical" number of builds should be and see that there are seven (7) there: https://coveralls.io/builds/46695736

That can be perfectly OK, I'm just wondering if you're always getting the number of jobs you're expecting.

vtjnash commented 2 years ago

Yeah, there is not much activity on that repo, so I don't have more examples. I recently changed the number of build configurations (removing x86, adding Windows), so that is why you are probably seeing varying numbers of jobs.

afinetooth commented 2 years ago

Got it. Yeah, I see the repo is stable and updates are infrequent. Well, we can keep this open, or you can re-open it if it happens again. Or email us at support@coveralls.io and reference this issue. We'll try to find out what's happening.

vtjnash commented 2 years ago

This new report seems even stranger: https://coveralls.io/builds/48419690. It reports an overall status of 0 out of 0 lines covered, despite each individual job covering 80% of the file. And for status results it posted these 2 badges (corresponding to the "total" and to the last run to finish, respectively).

image

afinetooth commented 2 years ago

Hi, @vtjnash.

I re-ran the build: https://coveralls.io/builds/48419690

And see new results: Screen Shot 2022-04-20 at 10 45 18 AM

Screen Shot 2022-04-20 at 10 45 45 AM

Screen Shot 2022-04-20 at 10 45 35 AM

And in the PR: https://github.com/vtjnash/Pidfile.jl/pull/16

Screen Shot 2022-04-20 at 10 46 33 AM

Screen Shot 2022-04-20 at 10 46 54 AM


I can't speak to why the original build may have failed, but over the past 3-4 days we were affected by random, intermittent failing requests to the Github API. The usual effect was that PRs were not receiving status updates, which doesn't seem to be the situation in your case. But I do wonder if the underlying issue led to your build calculating coverage incorrectly / missing source files.

The problems mentioned above have resolved sometime in the last 24-hours, so please let me know if you continue to see any further issues like this and that may negate my theory.

stale[bot] commented 1 year ago

This issue has been automatically marked for closure because it has not had recent activity. It will be closed if no further activity occurs. If your issue is still active please add a comment and we’ll review as soon as we can. Thank you for your contributions.