Closed jonawals closed 1 year ago
Love the data! How will test time compare against build times in AR after this change? I'm assuming the long pole will continue to be the non-unity Linux build?
I have been looking forward to this for a while and have been getting info on it wherever I can. Test speed and efficiency are things I focus on a lot with automated tests (especially the slower python ones) and this will help enhance that.
@jonawals Couple questions:
CMakeLists.txt
like this:
ly_add_pytest(
NAME AutomatedTesting::Atom_Main_Null_Render_MaterialEditor
TEST_SUITE main
TEST_SERIAL
TIMEOUT 700
PATH ${CMAKE_CURRENT_LIST_DIR}/TestSuite_Main_Null_Render_MaterialEditor_01.py
RUNTIME_DEPENDENCIES
AssetProcessor
AutomatedTesting.Assets
MaterialEditor
COMPONENT
Atom
)
It will run my test if the "Atom" component contains changes in the current PR?
Thanks again, can't wait for this feature to go in.
Definitely love the focus on reducing execution times! If we accept this test-skipping feature, it should still exist alongside other performance and stability improvement efforts to directly improve the long running tests.
I have some concern about excluding tests based on the historical coverage of their call-graph. While this is a good approximation of what dependencies should exist for unit-scope tests, it is a less stable approximation for tests with a call graph magnitudes larger. Particularly because that graph is under near-constant change from seemingly unrelated features, including changes that can dynamically vary at runtime. The purpose of the broader tests is to detect the subtle issues where an incoming change unintentionally introduces new broken dependencies, or breaks implicit dependencies such as assets and configuration files and other environmental state. Most implicit dependencies appear to not be tracked, and seem nigh uncountable coverage to add metrics for. Due to this, deselecting the larger tests would create a tradeoff that primarily finds new regression within one feature's code, instead of between features and dependencies. (However only reducing the set of unit-scope tests would greatly reduce the overall throughput improvement, so there's definitely desire to reduce the broader set!)
Looks great @jonawals, I wanted to verify a couple of points
it takes on average three attempts at AR
Do you have the data for this? Or is this based on anecdotal evidence? It'd be great if we could look back over the data and see this number
it takes on average 15 to 20 minutes
Same for this, can you point to the tracking data for this if possible (this will be super useful to help then compare the before/after for TIAF)
I'm also curious to learn more about the other platform support (most notably Linux). Is is possible to swap out OpenCppCoverage (only supported on Windows) with something like gcov or lcov on Linux to achieve the same effect?
Love the data! How will test time compare against build times in AR after this change? I'm assuming the long pole will continue to be the non-unity Linux build?
Yeah, our long poles will still be the biggest drag on AR, but if we can, on average, a) reduce the time spent running tests and b) reduce the number of attempts to pass AR, then we should see some meaningful savings over time 👍
I have been looking forward to this for a while and have been getting info on it wherever I can. Test speed and efficiency are things I focus on a lot with automated tests (especially the slower python ones) and this will help enhance that.
@jonawals Couple questions:
1. Are there any code previews or examples of the system itself we can see as to how its implementation looks so far? I know its mostly on the pipeline side but just curious as to what it looks like. 2. The explanation is thorough and I was able to get an idea of how it will work but one thing I was confused about was the Indirect Source Coverage (ISC) functionality with components and python tests. Since no code example exists, do you mean that if I had a python test file registered via `CMakeLists.txt` like this:
ly_add_pytest( NAME AutomatedTesting::Atom_Main_Null_Render_MaterialEditor TEST_SUITE main TEST_SERIAL TIMEOUT 700 PATH ${CMAKE_CURRENT_LIST_DIR}/TestSuite_Main_Null_Render_MaterialEditor_01.py RUNTIME_DEPENDENCIES AssetProcessor AutomatedTesting.Assets MaterialEditor COMPONENT Atom )
It will run my test if the "Atom" component contains changes in the current PR?
Thanks again, can't wait for this feature to go in.
Regarding 1), the runtimes for the C++ and Python tests live in Code\Tools\TestImpactFramework
and the scripts to glue it all together in AR live in scripts\build\TestImpactAnalysis
,
Regarding 2), what will happen is that the Python Coverage Gem (AutomatedTesting\Gem\PythonCoverage
) will listen to any components that are added to any entities in the Editor and log them. Where possible, these components are then used to backtrack to the Gems (if any) that these components belong to and infer that any source changes to those gems in turn mean the Python tests that those components were enumerated for are covering those component sources. So, in your example, should AutomatedTesting::Atom_Main_Null_Render_MaterialEditor
add any components that can be backtracked to sources in this way, we will run your test if those sources change 👍
Definitely love the focus on reducing execution times! If we accept this test-skipping feature, it should still exist alongside other performance and stability improvement efforts to directly improve the long running tests.
* What is the ongoing infrastructural cost of adding this feature?
In terms of infrastructure, there shouldn't be any additional costs so long as we aren't running both CTest and TIAF. Right now, we have our own internal AR pipeline dedicated to TIAF that runs in addition to the public AR for O3DE. For us internally, this is adding additional costs only because we are duplicating work. If and when we choose to move the LF AR over to TIAF, we should aim to do so in such a way we minimize (or remove entirely) any transition periods that require the duplication of AR work.
* How much data is persisted, and how much does it cost to synchronize it on the fleet? * Do different platforms have different coverage data/formats? * Does the instrumentation increase build or test execution times? (separate from potential gains from test deselection)
The data for each AR run is stored in an S3 bucket (separated out by repo, platform, branch and build config) and is minuscule in size (kilobytes). The data need not persist for long and there wouldn't be any problem with auto-culling the data after so many days or weeks. The worst thing that can happen should the data be trashed is a re-seed of the coverage. Right now, we only support Windows. This was originally due to the practicalities of prototyping for one platform only. Unfortunately, there is no cross-toolchain, cross-platform instrumentation tool so this would need to be adapted for, say, Linux and Mac (the TIA framework can accommodate different instrumentation on a platform by platform basis, it just requires more work to write it). As for the extra latency incurred by instrumentation, for Python tests the instrumentation comes for free, and for C++ tests, using a branch of OpenCppCoverage, we do under some circumstances add a bit more but additional optimizations have brought this to near enough parity with uninstrumented test runs (the overall savings are still a net benefit). It would be worth looking into a "one size fits all" instrumentation solution for C++ tests as OpenCppCoverage has its problems and is effectively a dead project (no PRs merged since May 2020).
* Will there be a way to opt-in to enabling TIAF deselection for a test module? * SIGs may want to onboard/offboard certain tests to be candidates for deselection
We have the ability right now to opt out of TIAF via a config file generated by CMake (tbh it's not the best approach but the code pathways are there, at least). What happens when a test target is excluded is it's just not selected at all. What we could do is add an additional list for "test targets to exclude from selection but run anyway". That way, if stakeholders want to be sure their test targets are always run regardless of TIAF's test selection, they can do so.
I have some concern about excluding tests based on the historical coverage of their call-graph. While this is a good approximation of what dependencies should exist for unit-scope tests, it is a less stable approximation for tests with a call graph magnitudes larger. Particularly because that graph is under near-constant change from seemingly unrelated features, including changes that can dynamically vary at runtime. The purpose of the broader tests is to detect the subtle issues where an incoming change unintentionally introduces new broken dependencies, or breaks implicit dependencies such as assets and configuration files and other environmental state. Most implicit dependencies appear to not be tracked, and seem nigh uncountable coverage to add metrics for. Due to this, deselecting the larger tests would create a tradeoff that primarily finds new regression within one feature's code, instead of between features and dependencies. (However only reducing the set of unit-scope tests would greatly reduce the overall throughput improvement, so there's definitely desire to reduce that set!)
We have the option to run TIAF in "safe mode". What this does is select the tests to run, but run the discarded tests afterwards as a separate run. We send all of this data off to wherever it needs to be stored for reporting purposes and then we can analyses any misfires were TIAF discarded a test that turned out to detect legitimate regressions for the given changes in the PR. I think as part of the onboarding process for TIAF we should definitely consider using this mode for a while to gather enough data for us to decide whether the tradeoff is worth letting, say, nightly builds pick up any misfires (if any) etc. 👍
Looks great @jonawals, I wanted to verify a couple of points
it takes on average three attempts at AR
Do you have the data for this? Or is this based on anecdotal evidence? It'd be great if we could look back over the data and see this number
it takes on average 15 to 20 minutes
Same for this, can you point to the tracking data for this if possible (this will be super useful to help then compare the before/after for TIAF)
I'll reach out to my colleague who has this data handy and update here 👍
I'm also curious to learn more about the other platform support (most notably Linux). Is is possible to swap out OpenCppCoverage (only supported on Windows) with something like gcov or lcov on Linux to achieve the same effect?
We can, and that would make perfect sense for the Linux and Mac platforms. All that is required is to make the modifications to the build configs to instrument the binaries accordingly and write a small adapter for the generated coverage data 👍
@jonawals
I'll reach out to my colleague who has this data handy and update here 👍
Excellent thanks!
We can, and that would make perfect sense for the Linux and Mac platforms. All that is required is to make the modifications to the build configs to instrument the binaries accordingly and write a small adapter for the generated coverage data 👍
Great to hear, I think once things are settled on Windows this is a logical next step to improve Linux test times
Given the large scope, we likely should bring this in front of the TSC and not unilaterally accept from only SIG-Testing. After review and refinement with this group, this should seek wider approval.
As long as there's an easy opt-in, then I see little reason to prevent other SIGs from using such a tool. Opt-in seems initially preferable to opt-out while onboarding to a new tool, though switching to opt-out (default in) may eventually be preferable in the long term.
Then this would only require SIG-Testing to document why someone would want to opt to always have a given test run, and leave the power in other SIGs hands. For example: benefit of running tests less often VS potential pain of a deselected test breaking, tangentially-related broken code shipping, and then AR blocking the next new changes to the test's feature area.
Additional discussion between @FuzzyCarterAWS and @jonawals :
Jenkins' test runtimes would become more spiky in the future with this feature enabled. I don't directly see this an issue, but it may be slightly surprising if the AR suites grow to be able to spike up to say 5 hrs for a core-checkin. Should be less of an issue if all tests are configured to always run in the Periodic builds. (frankly a separate issue if SIGs ever start allowing 5 total hours of tests into the smoke+main suites)
Notable that OpenCPPCoverage is GPL 3.0 and thus any fork should also be. May be a snag for the TSC to buy in.
Is TIAF intended to be locally runnable? Would we be syncing its dependencies alongside O3DE's normal dependencies?
Thank you @Kadino for putting the discussion points of the SIG-Testing meeting on here!
Notable that OpenCPPCoverage is GPL 3.0 and thus any fork should also be. May be a snag for the TSC to buy in.
We do have a fork, but it inherits the GPL 3.0 license (we forked in order to open a PR to merge in the new coverage mode required by TIAF, but no PRs have been merged since summer 2020...). As we do not host nor distribute the source or binary, and as we only use the binary, both LF and our internal legal team have determined we're ok to use said binary 👍
A longer term solution would be to implement our own coverage tool as it would appear that OpenCppCoverage is a de facto dead project (despite new PRs and GHIs being opened). If TIAF is deemed a necessary feature of O3DE, this would actually solve quite a few problems (cross platform-ness, licensing, and the lack of maintenance for OpenCppCoverage).
Is TIAF intended to be locally runnable? Would we be syncing its dependencies alongside O3DE's normal dependencies?
Not intended, per sé, but as we are not distributing the source or binary with O3DE then users themselves can make the call as to whether the licensing restrictions are prohibitive or not for their own purposes.
As long as there's an easy opt-in, then I see little reason to prevent other SIGs from using such a tool. Opt-in seems initially preferable to opt-out while onboarding to a new tool, though switching to opt-out (default in) may eventually be preferable in the long term.
Then this would only require SIG-Testing to document why someone would want to opt to always have a given test run, and leave the power in other SIGs hands. For example: benefit of running tests less often VS potential pain of a deselected test breaking, tangentially-related broken code shipping, and then AR blocking the next new changes to the test's feature area.
Opt-in sounds like a solid plan, although for the initial onboarding I would go as far as to say to use safe mode so that we can gather the data while still running the deselected tests. The data we generate from this can feed into stakeholder decisions as to whether or not they wish to opt-in when we take safe mode off.
I'm drifting off-topic a bit here but this is all useful stuff we can bring to the discussion with the wider audience 👍
Then this would only require SIG-Testing to document why someone would want to opt to always have a given test run, and leave the power in other SIGs hands. For example: benefit of running tests less often VS potential pain of a deselected test breaking, tangentially-related broken code shipping, and then AR blocking the next new changes to the test's feature area.
@jeremyong-az, the chair of the TSC replied that they will defer to the expertise of sig-testing and sig-build unless the discussion needs to be escalated 👍
Looks good for us to move this into final review stage, tentative agenda item for 29/11/2022 SIG Testing meeting @Kadino
SIG-Build agrees with this implementation and there are no immediate issues we're seeing from our perspective (defer to SIG-Testing on methodology). We are still concerned with the use of OpenCPPCoverage (for all the reasons mentioned in this thread), but as we discussed in prior meetings, we will assume that there will be an eventual offramp plan. For future goals, we also may want to further improve on local and agnostic storage devices for the cache, as not all contributors may have the ability to use S3 in their use cases.
The data provided for test time improvements are impressive! Total end-to-end runs should be improved, but agree with other statements about certain build targets being the long tail. SIG-Build does have upcoming improvements that might help, but we need further efforts to make compilation more efficient.
I do have a question about caching, however: You mentioned that cached artifacts will be culled on a regular basis, but does that use any LRU/MRU replacement strategies? Just seeing if we're maximizing the retention of the artifacts for first/clean runs.
Based off feedback from Testing SIG meeting we are good to close this out and move forward. @jonawals can you address @amzn-changml question above outside of this.
Summary
This feature proposes a change based testing system called the Test Impact Analysis Framework (TIAF) that utilizes test impact analysis (TIA) for C++ and Python tests to replace the existing CTest system for test running and reporting in automated review (AR). Rather than running all of the tests in the suite for each AR run, this system uses the source changes in the pull request (PR) to select only the tests deemed relevant to those changes. By running only the relevant subset of tests for a given set of source changes, this system will reduce the amount of time spent in AR running tests and reduce the frequency at which flaky tests unrelated to the source changes can block PRs from being merged.
What is the relevance of this feature?
As of the time of writing, it takes on average three attempts at AR for a given PR to successfully pass prior to merging. During each run, it takes on average 15 to 20 minutes for all of the C++ and Python tests to be run. When a given AR run fails, sometimes this will be due to pipeline and build issues (beyond the scope of this RFC) and other times due to flaky failing tests that are not relevant to source changes. This system will reduce the amount of time spent in AR by reducing the number of tests being run for a given set of source changes as well as reducing the likelihood of irrelevant, flaky tests from causing false negative AR run failures.
Feature design description
TIA works on the principle that for a given set of incoming source changes to be merged into the repository, the tests that do not cover the impacted source changes are wasted as they are incapable of detecting new bugs and regression in those source changes. Likewise, the opposite is true: the tests that do cover those changes are the relevant subset of tests we should select to run in order to detect new bugs and regression in those incoming source changes.
In order to determine which tests are pertinent to a given set of incoming source changes we need to run our tests with instrumentation to determine the coverage of those tests so we can build up a map of tests to the production sources they cover (and vice versa). Each time we select a subset of tests we deem relevant to our incoming source changes we need to also run those tests with said instrumentation so we can update our coverage map with the latest coverage data. This continuous cycle of select tests → run tests with instrumentation → update coverage map for each set of source changes coming into the repository is the beating heart of TIA.
Test selection
The circle below represents the total set of tests in our repository:
The red circle below represents the relevant subset of tests selected for a given set of incoming source changes:
In this example, the red circle is notably smaller than the white circle so in theory we can speed up the test run time for our incoming source changes by only running the tests in the red circle that we have deemed relevant to those changes.
In practice
Unfortunately, OpenCppCoverage, the open source, toolchain independent coverage tool available on the Windows platform is far too slow for use in the standard implementation of TIA. Furthermore, there is no tooling available to perform TIA for our Python tests. As such, two new algorithms have been developed to speed up TIA enough for real world use in O3DE for both our C++ and Python tests.
Technical design description
C++ tests
Fast Test Impact Analysis (FTIA) is an algorithm developed as part of the TIAF that extends the concept of TIA to use cases where the time cost of instrumenting tests to gather their coverage data is prohibitively expensive.
FTIA works on the principle that if we were to generate an approximation of test coverage that is a superset of the actual coverage then we can play as fast and loose with our approximation as necessary to speed up the instrumentation so long as that approximation is a superset of the actual coverage.
Test selection
The yellow circle below represents the approximation of relevant subset of C++ tests in a debug build for a given set of incoming source changes:
Notice that the yellow circle is notably larger than the red circle. However, notice that it is also notably smaller than the white circle.
Test instrumentation
We achieve this by skipping the breakpoint placing stage of OpenCppCoverage. In effect, this acts as a means to enumerate all of the sources used to build the test target binary and any compile time and runtime build targets it depends on. The end result is that it runs anywhere from 10 to 100 times faster than vanilla OpenCppCoverage, depending on the source file count. Not only that but with some further optimizations (unrelated to instrumentation) we can bring the speed of running instrumented test targets to be at parity with those without instrumentation.
Is this not the same as static analysis of which sources comprise each build target?
Static analysis of the build system by walking build target dependency graphs to determine which test targets cover which production targets is another technique for test selection in the same family as TIA but with one important detail: it is incapable of determining runtime dependencies (e.g. child processes launched and runtime-linked shared libraries loaded by the test target, which some build targets in O3DE do). Not only that but with FTIA we get another optimization priced in for free: for profile builds, the number of sources used to build the targets in O3DE is typically considerably less than those of a debug build (the latter being somewhat a reflection of the sources specified for the build targets in CMake). This is due to the compiler being able to optimize away dead code and irrelevant dependencies etc., something that would be functionally impossible to do with static analysis alone.
The blue circle below represents the approximation of relevant subset of C++ tests in a profile build for a given set of incoming source changes:
Notice that the blue circle is notably smaller than the yellow circle and approaching a more accurate approximation of the red circle. In effect, we had to do nothing other than change our build configuration to leverage this performance optimization.
Python tests
Indirect Source Coverage (ISC) is the novel technique developed to integrate our Python end-to-end tests into TIAF. The goal of ISC is to infer which dependencies a given test has in order to facilitate the ability to map changes to these dependencies to their dependent tests. Such coverage is used to associate the likes of assets and Gems to Python tests, shaders and materials to Atom tests, and so on. Whilst the technique described in this RFC is general, the specific focus is on coverage for Python tests such that they can be integrated into TIAF.
Basic approach
The techniques for inferring source changes to C++ tests are not applicable to Python tests as FTIA would produce far too much noise to allow for pinpointed test selection and prioritization. Instead, Python test coverage enumerates at edit time components on any activated entities as their non-source coverage to determine which Python tests can be selected when the source files of these components are created, modified or deleted.
An additional optimization step is performed: for each known component, the build system is queried for any other build targets (and their dependencies) that any of the known component's parent build targets are an exclusive depender for. These exclusive dependers are thus also deemed to be known dependencies for any Python tests that depend on said parent build target.
Differences Between C++ Coverage and Test Selection
Python test selection differs from C++ test selection on a fundamental level: C++ test selection is additive, whereupon the selector optimistically assumes NO C++ tests should be unless a given test can conclusively be determined through analysis of coverage data whereas Python test selection is subtractive, whereupon the selector pessimistically assumes ALL Python tests should be run unless a given test can be conclusively determined through analysis of non-source coverage. This subtractive process allows for test selection to still occur with only tentative, less rigorous coverage data that can be expanded upon as more rules and non-source coverage options are explored without the risk of erroneously eliminating Python tests due to incomplete assumptions and limited data.
Limitations of ISC
The limitations of this proposed approach are as follows:
Technical Design
The PythonCoverage Gem is a system Gem enabled for the AutomatedTesting project that enumerates at run-time all components on all activated entities and writes the parent module of each component to a coverage file for each Python test case. Using this coverage data, the build target for each parent module binary is inferred and thus the sources that compose that build target. Should a change list exclusively contain source files belonging to any of the enumerated components it is eligible for Python test selection. Otherwise, it is not an all Python tests must be run.
Coverage Generation
Dynamic Dependency Map Integration
Test Selection
What are the advantages of the feature?
For the most part, the benefits will come without any need for input from team members as it will be running as part of AR to reduce the time your PRs spend in AR, as well as reducing the likelihood of you being blocked by flaky tests. However, the framework itself is a complete model of the build system so there is a lot of scope for using this information to implement even more AR optimizations. For example, as the framework knows exactly which sources belong to which build targets, and which of those build targets are test targets and production targets, it knows that, say, a text file is not part of the build system (or covered by any tests) so changes to that file can be safely ignored by AR. Although it’s hard to speculate what the long term benefits of this framework will be, we are confident that other teams and customers will make good use of this framework to implement all manner of novel optimizations and useful repository insights.
C++ tests
On average, we are currently rejecting 85% of tests to run for incoming source changes which has resulted in speedups of approximately 5x compared to that of running all of our tests (and there’s still a lot of room for improvement, an end goal of >95% test rejection and 15-25x overall speedup of test run times is realistic once all the other pieces are in place).
Above is the plot of test run duration for 130 builds in O3DE's AR servers using FTIA. The orange line represents the average duration of all of thees builds (which currently stands at just over 30 seconds). Notice that many builds take significantly less than this, and it is only the long pole test runs (those where the source changes touch root dependencies, e.g. AzCore) that are pushing the average higher.
Above is the selection efficiency (amount of tests rejected as being irrelevant for the source changes) for the same 130 builds. The average efficiency is about 85% (85% of tests being rejected) with many more being at or near 100%.
Python tests
We ran the Python test selector for the past 50 pull requests and measured the test selection efficiency, that is, the percentage of the 26 Python test targets in the Main suite that were rejected as being not relevant to the source changes in the PR. Our baseline efficiency to beat, as offered by our current CTest approach, is 0% (no test targets rejected), so anything above 0% would be considered a win.
In the graph above, the orange line represents the average test selection efficiency over all 50 PRs (currently sitting at 30%). That is to say that, on average, 30% of test targets are rejected for a given PR (even if the data itself is very polarized). Although this falls far short of the 85%+ efficiency of the C++ test selection under Fast Test Impact Analysis, even if no further improvements were to made we would still be cutting down the number of Python tests our customers need to run in AR by nearly a third. Over time, these numbers add up to significant, measurable savings in terms of server costs and velocity that PRs can be merged into the repository. Of course, we do have plans to improve the efficiency, but that’s a challenge for another day.
What are the disadvantages of the feature?
This new system undoubtedly introduces far more complexity than using the existing CTest system. This complexity in turn means responsibility to further develop and maintain the system, as well as more points of failure that could hinder AR should any breaking changes be introduced into the system.
How will this be implemented or integrated into the O3DE environment?
The average developer need do nothing but sit back and reap the benefits of time savings in AR when they wish to merge their PRs. For the teams responsible for maintaining the build system and AR pipeline, this system will require the CTest pipeline stages to be replaced with the TIAF pipeline stages. We have been testing this system with our own shadow pipeline to weed out any teething issues so will be able to provide a strategy for seamless replacement into O3DE's AR pipeline.
Currently, only Windows is supported. Should this system be rolled out with success, other platforms will be supported.
Are there any alternatives to this feature?
There are no free, open source and toolchain independent tools for providing TIA on the Windows platform.
How will users learn this feature?
Documentation will be provided but, for the most part, this feature will be transparent to the user. There is nothing they need to do in order to use this system other than open up a PR.
Are there any open questions?