Analyze reduced coverage when Epetra is disabled

bartlettroscoe commented 1 month ago

CC: @ccober6, @cgcgcg, @rppawlo, @sebrowne, @jwillenbring, @mmwolf

Description

With Epetra (and downstream packages and code that has required dependencies on Epetra) set to be removed in Trilinos 17.0 before the end of FY25, we need to be concerned about the loss of coverage of downstream packages that have tests based on Epetra. (Loss of coverage is bad.)

Up until recently, it was not possible to post coverage results to CDash and compare coverage of Trilinos packages with and without Epetra enabled. But recently Kitware has fixed CDash to be able to upload and display coverage results and the Trilinos framework team has been posting coverage builds with Epetra on and Epetra off to an test CDash test site https://trilinos-cdash-qual.sandia.gov.

So now we can start to compare the coverage of builds of Trilinos with and without Epetra enabled. Currently, the Trilinos Framework team is running the builds:

gnu-8.5.0-openmpi-4.1.6_debug_shared_coverage_no-epetra (see results on results on 2024-10-10)
gnu-8.5.0-openmpi-4.1.6_debug_shared_coverage (see results on 2024-10-10)

There is some evidence that these two builds are not 100% equivalent. For example, the coverage numbers for the package Teuchos is not exactly the same:

Label	no-epetra	with epetra
Teuchos	87.59	87.71
TeuchosCommExes	93.45	93.45
TeuchosCommLibs	83.69	83.69
TeuchosCoreExes	97.36	97.36
TeuchosCoreLibs	90.85	91.1
TeuchosExes	91.52	91.52
...	...	...

If these builds were identical, then the coverage for all of Teuchos should be identical, not just for some of the labels. (But we will need to check that the coverage results are consistent from day to day to make sure there is not a defect in CTest or CDash.)

But even these early results show some significant drops in coverage in some cases. For example, we see noticeable drops in:

Label	no-epetra	with epetra
MueLuLibs	69.68	71.47
NOXLibs	43.3	57.25
PiroLibs	58.45	74.55
StratimikosLibs	63.74	79.28
TekoLibs	50.87	67.02
ThyraLibs	75.07	77.58

Obviously more analysis needs to be done there, but these would appear to be massive reductions in coverage in some cases when Epetra is disabled. That shows that some Trilinos packages are still depending on Epetra-based tests for a lot of their coverage. Also note that this coverage includes all usage in downstream Trilinos packages. This does not show the drop in coverage in the native package test suite. (One would expect the drop in coverage for the native package test suites to be much more pronounced when Epetra is disabled.)

Also, we need to carefully separate coverage of library code from test code. We are most concerned with coverage of library code (since that is what users depend on). (NOTE: We can most likely do that with labels on source files)

NOTE: Kitware needs to implement filtering by directory and a download API for the viewCoverage.php page so that we can download, analyze and compare this data in more detail (and line-by-line), and then we will be able to see where tests need to be added the most. (The SNL Kitware contract will have Kitware implementing these features later in FY25.)

Tasks

[ ] Fix the no-epetra and with-epetra builds to be identical (accept for disabling Epetra)
[ ] Have Kitware finish implementing key CTest and CDash features:
- [ ] Fix the filtering of results by directory on viewCoverage.php
- [ ] Add a download feature to viewCoverage.php
[ ] Add support to TriBITS or CTest to separate library code from test-only code (might be able to just do this with labels)
[ ] Add package-by-package no-epetra and with-epetra builds that only show coverage based on the package's native test suite. (NOTE: The tribits_ctest_driver() function already supports package-by-package builds and uploads to CDash.)
[ ] Create new GitHub issues for each Trilinos package showing significant gaps in coverage

mmwolf commented 1 month ago

I agree that reduction of coverage is a concern. We've been discussing this and trying to get packages to address this for the past couple of years. This is something we should continue to work to improve. However, I don't believe your data (as shown above) supports your statement about "massive reductions in coverage." Significant for some packages, yes. Massive reduction in coverage is an exaggeration.

bartlettroscoe commented 1 month ago

I agree that reduction of coverage is a concern. We've been discussing this and trying to get packages to address this for the past couple of years. This is something we should continue to work to improve.

@mmwolf, until now, people were just guessing as the reduction in coverage. Now there is some real data.

However, I don't believe your data (as shown above) supports your statement about "massive reductions in coverage." Significant for some packages, yes. Massive reduction in coverage is an exaggeration.

That is a matter of opinion. Personally, I could consider a reduction in coverage from 57.25% to 43.3% or 74.55% to 58.45% to be "massive".

But these coverage numbers don't really tell you what you want to know. What you really want to know is if there major features that are not being covered anymore. (And a small amount of code could break a major feature.) Reductions in these coverage numbers just tells you that you might be in trouble. Right now, now one can say with any confidence what feature coverage is being lost when Epetra is disabled.

ccober6 commented 1 month ago

The percent coverage might tell you some information but not much without additional information, so it is too early to label changes as significant or massive. Looking at Tempus data, there seems to be some inconsistencies in what is being displayed (e.g., Tempus's Epetra tests are not being displayed for the with-Epetra build/test). There are probably others.

I am assuming that the no-epetra case not only is turning off the Epetra tests but also is not counting the lines ifdef'ed for Epetra code. Thus we are changing the tests used AND the lines of code counted. The percentage therefore can change a lot and not in a clear way without additional information.

I think it is better to look at the raw line-coverage data and see if important functions/capabilities have coverage (as determined by developers and stakeholders). The Epetra coverage can indicate which features/code were previously covering some lines of code, but it can't tell you its importance (e.g., pivotal, minor, or dead code).

An aside: I am a little hesitant to use the term "reduction in coverage". Consider the following thought experiment: if a feature/code with high coverage is deprecated, leaving only a feature/code with lower coverage, the overall percentage coverage may decline, but the line coverage for the remaining code remains unchanged (no actual reduction in coverage), assuming no overlap in the features/code. If there is an overlap in the features/code, one might refer to it as a reduction in coverage. However, in either scenario, I would prefer to say that we are simply revealing the coverage of the remaining feature/code, which has always existed but is now more visible.

This is one reason I think we should look at feature coverage, but that is a harder problem (how do you define a feature?). :)

bartlettroscoe commented 1 month ago

The percent coverage might tell you some information but not much without additional information, so it is too early to label changes as significant or massive.

Agreed

I am assuming that the no-epetra case not only is turning off the Epetra tests but also is not counting the lines ifdef'ed for Epetra code. Thus we are changing the tests used AND the lines of code counted.

That is why we need to add labels to fully differentiate library code from test-only code.

I think it is better to look at the raw line-coverage data and see if important functions/capabilities have coverage (as determined by developers and stakeholders). The Epetra coverage can indicate which features/code were previously covering some lines of code, but it can't tell you its importance (e.g., pivotal, minor, or dead code).

Agreed, but that could be a tedious effort. The idea is that differences in bulk line coverage (at the package level, directly level, file level, etc.) will help us prioritize our scrutinization. (For example, you should likely first look at packages and files that see more than a 10% decrease in line coverage before you look at files packages and files with less than a 1% decrease in line coverage.)

Consider the following thought experiment: if a feature/code with high coverage is deprecated, leaving only a feature/code with lower coverage, the overall percentage coverage may decline, but the line coverage for the remaining code remains unchanged (no actual reduction in coverage), assuming no overlap in the features/code. If there is an overlap in the features/code, one might refer to it as a reduction in coverage.

That may be true in some cases, but I think most of the features in downstream packages should be equally valid for both Epetra and Tpetra backend implementation of the linear algebra data-structures.

This is one reason I think we should look at feature coverage, but that is a harder problem (how do you define a feature?). :)

Exactly! How do you enumerate features in something like library code like Trilinos in a systematic way? It is much easier with an application code that is driven through an input deck. (That is what Sierra is doing to define features, I believe.)

bartlettroscoe commented 1 month ago

@ccober6, you know what would likely be a better approach? What if all of the important Trilinos customers built and installed coverage-enabled configurations of Trilinos for all of their build configurations and use them when they run their application test suites. Then you gather up all of those Trilinos coverage files and post them to the Trilinos CDash site in special builds (Kitware could help us with that). Then you diff the coverage between what is produced by the Trilinos test suite (for all downstream packages and just the native package's test suite). Then you would clearly see where there were gaps Trilinos testing were.

The only problem with that approach is that many the Trilinos apps may also have poor coverage of their features which will therefore be poor coverage of Trilinos features.

The moral of the story is to do good requirements management and write strong high-coverage tests for production code. That takes us back to the TriBITS Lifecycle Model 👍

trilinos / Trilinos

Analyze reduced coverage when Epetra is disabled #13522

Description

Tasks