Rework CI & Tests Handling

DyXel commented 1 week ago

This PR reworks the Github Actions workflow files and its associated code, so that there's only 1 pipeline that does all that we might need:

Build cppfront
Run passthrough tests
Run regressions tests
... more in the future? see below

Currently, the pass-through tests are not executed in the CI, and regressions test have duplicated logic to build cppfront (granted, its not that complex anyways).

The idea is that whatever cppfront lowers or outputs should be platform agnostic, likewise for the built executables, and so, for most tests there should be only a single output with which to test against. Additionally, plenty of tests do not output anything when executed, so rather than creating empty files, we should check that the executables under test didn't output anything as well.

Let's take as an example this recent commit, which added 1 test: https://github.com/hsutter/cppfront/commit/54edaba0a66f205721127857a7bf75063f8b2fbd. It created the following files (related to testing):

regression-tests/pure2-trailing-comma-assert.cpp2
regression-tests/test-results/pure2-trailing-comma-assert.cpp
regression-tests/test-results/pure2-trailing-comma-assert.cpp2.output
regression-tests/test-results/pure2-trailing-comma-assert.out
regression-tests/test-results/apple-clang-14-c++2b/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/apple-clang-14-c++2b/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/apple-clang-15-c++2b/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/apple-clang-15-c++2b/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/clang-12-c++20/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/clang-12-c++20/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/clang-15-c++20-libcpp/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/clang-15-c++20-libcpp/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/clang-15-c++20/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/clang-15-c++20/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/clang-18-c++20/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/clang-18-c++20/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/gcc-10-c++20/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/gcc-10-c++20/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/gcc-13-c++2b/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/gcc-13-c++2b/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/gcc-14-c++2b/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/gcc-14-c++2b/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/msvc-2022-c++20/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/msvc-2022-c++20/pure2-trailing-comma-assert.cpp.output
regression-tests/test-results/msvc-2022-c++latest/pure2-trailing-comma-assert.cpp.execution
regression-tests/test-results/msvc-2022-c++latest/pure2-trailing-comma-assert.cpp.output

Other than the test itself, it added 25 new files: Two per each target, plus the lowered output, plus the output of cppfront. 23 of those files are empty.

After the rework, the added files would look like this instead:

test/regression/pure2-trailing-comma-assert.cpp2
test/regression/results/pure2-trailing-comma-assert/00-lowered.cpp
test/regression/results/pure2-trailing-comma-assert/01-cppfront.output

In general, the new format for regression tests would be:

test/regression/test-name.cpp2
test/regression/results/test-name/00-lowered.cpp (if cppfront doesn't throw an error)
test/regression/results/test-name/01-cppfront.output
test/regression/results/test-name/02-(ctx.target)$-compiler.output (only if compiler outputted something)
test/regression/results/test-name/03-executable.output (only if executable outputted something)

Where (ctx.target)$ would be the specified target (e.g. x64-linux-g++-c++20), again, only if the compiler outputted something. This reorganization makes it much easier to find results of a test and decreases considerably the amount of files needed per test, it also makes adding new targets trivial as most tests should not output anything when compiling.

The regression runner itself is written in Cpp2 as opposed to being a shell script. Ideally we treat testing just like we treat code, and re-use most logic of this file when implementing other types of test.

I started this as a side-quest, to pave up the way to test the reflection API (the functions intended to be used by end-users when they author their own metafunctions), and because it seemed to me that adding like 20~ files per new additional regression test is excessive.

DyXel commented 1 week ago

Opened as a draft, as this is not yet ready for merge. I would say its like 80% there, but that remaining 20% could prove to be a lot of effort, so I first want to know if there's interest in moving this forward, otherwise I am happy to shelve the work so far. I had a lot of fun writing the runner itself, and I think we need to rework the way we do tests anyway as we add other types of tests (for example reflection api tests as mentioned above, or unit tests for the compiler code).

hsutter commented 1 week ago

Thanks! Hot take: This could be good, but it will also take me a lot to absorb. I'm used to the current workflow, but that doesn't mean I can't get used to a better one. Is the idea to streamline the CI test pipeline, or also the checked-in tests for each platform? (The answer is probably in the "changed files" but there are over 3,000 so I didn't even try to look.)

The idea is that whatever cppfront lowers or outputs should be platform agnostic,

Yes, corresponds to that there's one /test-results directory independent of the Cpp1 compilers.

likewise for the built executables,

If you mean cppfront.exe, yes. If you mean an executable built for one of the test cases, those will vary considerably by compiler based on mesasges/quirks/bugs. A few of the tests aren't supported by GCC 10 for example but work fine on 11+, or hit a Clang 12 bug but work fine in 13+.

for most tests there should be only a single output with which to test against.

Wait, you mean per Cpp1 compiler? I currently keep one set of outputs (.execution and .output) per Cpp1 compiler because they all have different messages/quirks/bugs.

Additionally, plenty of tests do not output anything when executed, so rather than creating empty files, we should check that the executables under test didn't output anything as well.

That seems reasonable, replacing an empty file with no file. The potential danger is not detecting when a test failed to run (e.g., the Cpp1 compiler crashed)... that's one reason I keep the file even if it's empty.

Thanks for looking into this, though. I'm now leaving for the ISO C++ meeting so I'll be much slower to respond for the next week, but I appreciate all the input.

DyXel commented 1 week ago

Is the idea to streamline the CI test pipeline, or also the checked-in tests for each platform? (The answer is probably in the "changed files" but there are over 3,000 so I didn't even try to look.)

There's no way I am making someone review that many files 😂 So here the idea would be to first enhance the pipeline, without changing any tests (unless we notice some previously broken tests).

If you mean an executable built for one of the test cases, those will vary considerably by compiler based on mesasges/quirks/bugs. A few of the tests aren't supported by GCC 10 for example but work fine on 11+, or hit a Clang 12 bug but work fine in 13+. Wait, you mean per Cpp1 compiler? I currently keep one set of outputs (.execution and .output) per Cpp1 compiler because they all have different messages/quirks/bugs.

Yes, I am aware some tests have the compiler output something, that should be caught with this new system as well. The main difference would be that their outputs will sit on the same folder.

The potential danger is not detecting when a test failed to run (e.g., the Cpp1 compiler crashed)... that's one reason I keep the file even if it's empty.

I think we can easily detect if something like that happened (for example by checking the error code and making sure there was actually some output in stdout/stderr). But yes, the main idea is to reduce the filecount to something reasonable, while giving a better organization to the test suite. I actually have an item in my todo list to create a "control test" for each case that a test can result in.

Thanks for looking into this, though. I'm now leaving for the ISO C++ meeting so I'll be much slower to respond for the next week, but I appreciate all the input.

Thanks for the heads-up, see you around!

DyXel commented 1 week ago

To Dos if we decide to move this forward (in no particular order):

[x] Make regression-runner run on Windows
[x] Clean-up regression-runner code a bit
[ ] Add all previous targets/configs to the new pipeline file
[ ] Add extra target that compiles in debug (as opposed to O2 build)
[ ] Add some "control cases" to make sure the logic on the test runner behaves as expected (and fix bugs along the way of course)
[ ] Create a script that will rename test files to the new structure
- We shouldn't actually create the new test results with the new runner, because that would break the "trust chain" established by the current pipeline.
[ ] Execute passthrough test on Pipeline
[ ] Copy patch upload from existing regression test pipeline
[ ] Add option to the test runner to overwrite test results

Wishlist (not necessary for merge IMO but would be really cool to have):

[ ] Add pretty/colored output
[x] Run tests in parallel (multi-threaded)
- ~~The problematic bit seems to be launching programs (which changes work directory), maybe a global lock + std::async ought to work.~~
[ ] Write some guide on how to use the regression-runner
[ ] Allow adding specific settings to a test within the file itself as a comment
- For example, different flags for cppfront, in case we want to test the flag itself

DyXel commented 1 week ago

Removed all the test changes for now.

MaxSagebaum commented 1 week ago

Opened as a draft, as this is not yet ready for merge. I would say its like 80% there, but that remaining 20% could prove to be a lot of effort, so I first want to know if there's interest in moving this forward, otherwise I am happy to shelve the work so far. I had a lot of fun writing the runner itself, and I think we need to rework the way we do tests anyway as we add other types of tests (for example reflection api tests as mentioned above, or unit tests for the compiler code).

I think this would be really nice and it would also simplify updating the results.

MaxSagebaum commented 1 week ago

I have not looked at the MR yet. But one addition we currently do not have is parallel evaluation. Is this now possible would it be "easy" to add with the new framework?

DyXel commented 1 week ago

I have not looked at the MR yet. But one addition we currently do not have is parallel evaluation. Is this now possible would it be "easy" to add with the new framework?

It's on my wishlist above. To make this multithread safe (thus run things in parallel) we'd need to implement a safe launch_program function (for posix-like and Windows) and then launch and collect tasks (std::async should make this somewhat easy). Another option (which I don't like and personally wouldn't implement) is run the executable once for each test and collect the results outside the program with a intermediate script.

DyXel commented 1 week ago

I have not looked at the MR yet. But one addition we currently do not have is parallel evaluation. Is this now possible would it be "easy" to add with the new framework?

Just pushed the changes to run things in parallel, super simple implementation with std::async, take the results with a grain of salt, but on my machine running all tests went roughly from 1m23.470s to 0m14.370s. 🤯

MaxSagebaum commented 6 days ago

This sound nice. I will have closer look next week. Maybe I can already use this to improve testing for the regexes.

hsutter / cppfront

Rework CI & Tests Handling #1137