Spike multi-tool/multi-version testing coverage for Nokee

lacasseio commented 2 years ago

This issue is a brain dump in relation to improving testing coverage for Nokee. I will try my best to untangle everything. See https://github.com/nokeedev/gradle-native/issues/514, https://github.com/nokeedev/gradle-native/issues/513 and https://github.com/nokeedev/gradle-native/issues/526 for some short term goals.

Problem Space

To start, we want to avoid blindly spreading all tests across all permutations of tools and their supported versions. It's counter-productive and doesn't address the core issue which is focusing on what we are trying to verify. Some verification only needs a tool (regardless of their versions, vendor, etc.) that fulfill some requirements, i.e. can compile Swift 3 source files or can compile C sources, or an Xcode installation. We can be as abstract or as precise as we need, i.e. Clang 13.0.2 or MSVC 2022 with component C++, MSBuild, etc.. The idea is to declare what we need for the verification to be successful.

In some cases, we need to execute the same test against a wider range of versions (and possibly tools). For example, testing that we can detect GCC, assuming we have GCC 4 to 11 available, a full test would need to check against GCC 4, 5, 6, 7, 8, 9, 10, 11. However, a quick test may only check against the latest, e.g. GCC 11. We may also want to configure what "quick" means under certain scenarios. With our GCC example, we may want to include GCC 4 as well given the tool is old and may have special handling in the production code.

Tools are also subject to availability. We may have to run on older machines to access older tools where newer tools may not be available. Some tools are also only available on specific operating systems such as MSVC and MinGW are only available on Windows machines while Xcode is only available on macOS while swiftc is provided by Xcode on macOS but as a standalone toolchain on Linux. In those cases, it's easy to unintentionally skip some tools resulting in a lack of coverage without clear signals.

During development, we need sensible defaults so we can quickly and efficiently develop the code while being able to force certain tools to be included in our tests. Here we are constraint by Gradle and its integration with the IDE. We want to avoid as much as possible behaviour hacking on Gradle tasks, e.g. folding multiple behaviours in the same tasks but we also want to avoid generating every possible behaviour permutation as Gradle tasks. Regardless of how the test tasks are split/configured, the what should always be obvious and straightforward, e.g. the metadata attached to the test tasks. This metadata will be the key to composing our CI pipeline.

Current Situation

We currently use an outdated Spock 1.x extension for all of our functional tests. The Spock extension was picked up from Gradle code base which is now using Spock 2.x. As a whole, we want to remove our dependency on Spock meaning a Spock solution is not possible. Our reason to move away from Spock is mostly due to our unfortunately bad experience with Spock/Groovy as a testing framework/language.

The Spock extension duplicates test cases based on their coverage context which supports all, partial, default or specific versions. The extension will detect available versions and execute its selection according to those candidates. A tool's availability is never asserted allowing for "unintentional skip" leading to lack of coverage. Despite that, we currently only use the default coverage context meaning we always test against the first available tool. The extension also allows some support to specify tool requirements.

One considerable downside to the Spock extension is all the discovery of toolchains is done by the extensions. The problem lies with the separation of concerns where the build system is responsible to make decisions based on the environment but the extension makes its own decision as well regarding what it sees from the environment. There is a conflict of responsibility. Ideally, the data flow should be in one direction and the extension should focus on the test execution based on the data provided by the build system.

Proposed Solution

The proposed solution is a combination of multiple pieces that collaborate together:

Test suite variant. Gradle speaks in terms of tasks so does IntelliJ. If we want quick menus for executing tests or quickly cookie-cutting ranges of tests to execute, we need to create various test suite variants. Our gradle-plugins/toolbox project takes care of the variant calculation via testingStrategies. We have to carefully declare our testing variants to avoid generating an unmanageable number of additional test tasks.
Testing strategies mapping. The most important point is the testing strategies maps to the test suite variant. It means we can have multiple testing strategies that map to fewer test suite variants. For example, all operating system family strategies could map to a single test suite variant. It's not a groundbreaking example but if we consider that we can't execute tests remotely to different operating systems, it serves no purpose to generate multiple tasks. However if we could remote execute, we will have to generate multiple tasks. Regardless of how we map the strategies to the variant, the metadata included in the testing strategies should be the same as points 4 and 5 will use that data to select which tasks to execute.
Environment detection. Each CI agent may be different, i.e. macos-10.15 vs macos-11`. Our local machine may also be different. When developing, we may not care about the exact tool versions available. However, on CI we care that our code is tested against the exact tools and versions.
Continuous integration testing. Considering the environment, our CI tasks should select the right tasks to include for each coverage bucket. A quick test on ubuntu-latest will select all OS-agnostic tests as well as Linux-specific tests and on windows-latest we will select only Windows-specific tests. The behaviour is different than local execution of the same task where OS-agnostic tests are always selected. The version coverage is also different, a quick test on macos-10.15 would test against Xcode 12.4 but on macos-11 we would test against Xcode 13.2.1. A full test would select all available versions that we care about (sometimes selecting a representative subset of all available versions is good enough).
Coverage reporting and assertion. As shown in the last point, the latest available version varies based on the CI agent. It's important that we 1) declare the testing coverage we expect, 2) slice and dice the execution onto all agents that will fulfill the expected testing coverage and 3) we assert our expected testing coverage was met. Given it's a cooperation between multiple machines, it's quite important that the bulk of the knowledge is held by the build system so we can manipulate it however we need.

OS Strategies

Some tests require specific OS environment while others can run on any OS (OS-agnostic). Typically, users would simply annotate their tests with @EnabledOnOs and co. Then users would execute the test on its respective OS to get full coverage. In a utopic world, the build system would be able to spawn the right environment for the test locally and on CI. For this, we would need multiple tasks (one for each scenario). However, more tasks can become a bit harder when it comes to calling the right one especially in IntelliJ with the quick test run button. There are three different solutions: 1) fold all OS test variants into a single variant or 2) disable unnecessary test variants to hide them from IntelliJ during sync or 3) mix of all both solutions. The reason for folding all OS test variants into a single variant is simply guided by the fact that during development, we usually want to run all agnostic and current OS tests. Creating multiple variants opens the possibility to distribute the test onto other OS from a single machine. Regardless of the choice here, there should be no impact on all the other pieces of the solution (for CI, reporting, assertion, etc.).

Multi-tool and multi-version Strategies

Computing every permutation of tool and version as test variants may be overkill. The Gradle codebase uses coverage context which is then controlled by system properties. We aren't a big fan of this approach simply because it requires a bit of gymnastics to configure and run locally. We feel there is a good middle ground between coverage context and exact tool/version. Just like the OS strategies, regardless of how we do it, the metadata should be the same removing any impact of the other pieces of the solution.

lacasseio commented 2 years ago

An additional problem not mentioned is tests requiring multiple tools especially if they need a wide range. The solution would be to model all tools together as part of a pseudo-installation. As the permutation goes up, it's always going to get messy. We just need to manage the messiness by avoiding narrow-minded solutions.

lacasseio commented 2 years ago

It seems there is a clear distinction between tools/versions under test and the testing strategies. We should model both separately so we can reuse the tools/versions under test to assert coverage at the end of the CI pipeline. Also, tools/versions may map to overlapping testing strategies for different contexts/scenarios. One such example is the coverage context where for development we will want something easy like partial (or latest available versions) or all (or all available versions) or default (or first available versions). For CI, we may want to have xcode12.4 (all test that needs exactly Xcode 12.4) or allXcode (all test that needs any Xcode versions) or something similar.

lacasseio commented 1 year ago

We did a bit more thinking here. The initial write-up is fantastic! Good job past Daniel! The point that should be cleared up is the distinction between the requirements vs coverage. The requirements will decide if the test can execute in the current environment. Ex: a test requires a C-compatible toolchain or requires any major OS. The coverage dictates what variants are "good enough" for executing this test. Ex: a major toolchain (any of GCC, Clang or MSVC) or every latest GA major toolchain (GCC 12.2, Clang 15.0.1, MSVC 2022).

It's important to note that requirements imply some coverages. Ex: requiring Gradle 7.5 and up would imply Gradle 7.5, 7.5.1 and nightly. The coverage declaration would make the coverage explicit.

Using coverage context, we can select subgroup of the coverage selection: latest, latest-available, all, partial, default, etc. We could see an uber-all context that would select every possible coverage. In general, the coverage should be selected from the explicit coverage.

Some coverage declarations could be more lenient in the selection. Ex: requires a C99 compatible toolchain and coverage any one toolchain, then GCC 9 would be the only test variant and be as valid as Clang 10 or MSVC 2019.

Some coverage/requirements may imply some selection by default. For example, MSVC would imply OS coverage on Windows. While strictly GCC or Clang would imply Linux out of the most cost-effective machine to use.

When asserting the coverage, we would consider all this information and cross reference test execution between all CI jobs. For coverage that specifically dictates multiple OS execution, if the test were skipped on Windows, we would deem the test coverage a failure. However, if we would specify any OS that implies Linux by default, then skipping on Windows or Mac would be correct if it wasn't skipped on Linux.

To recap, there is a distinction to be made between requirements and coverage. The coverage would also fuel the parameterized test.

nokeedev / gradle-native