Proposed RFC Feature: Batched tests with no results are recollected and run in a new batch rather than failed

smurly commented 2 years ago

Summary:

Currently in editor_test.py when a batched list of tests encounters a timeout or a test exits the editor prematurely the tests which did not execute have no result and currently report as failed. This results in confusion for many who see a list of failed tests and assume they ran and failed.

What is the relevance of this feature?

Actually running tests rather than failing them will provide more useful information to debug the actual timeout or unexpected exit.

Feature design description:

When a Report summary includes tests without results, those tests will be collected in a new batch
The new sub-batch of tests will be executed again to generate results
If a timeout was the cause we will not rerun the test which was running at the timeout (the one that caused the issue)
Tests which already had no results and were rerun will not be collected for a 3rd run.

Technical design description:

Recollection and execution occurs for tests which report no results Recollected tests should not be run a 3rd time if they fail to report results Results will clearly call out no result rather than failure. "Test did not run or had no results!"

What are the advantages of the feature?

Removes reports of failures which are really just did not run. This reduces initial confusion among developers troubleshooting the AR results.

What are the disadvantages of the feature?

Increased run time for AR
Need to be careful that we don't collect the test which caused the unexpected exit or timeout and just repeat the cycle

How will this be implemented or integrated into the O3DE environment?

This will be part of editor_test.py and the pytest framework around it. It should just work without extra effort.

Are there any alternatives to this feature?

Educate everyone who reads a log as to what the output means more fully and expect them to deeply understand pytest and editor_test output.

How will users learn this feature?

Documentation would need to be created
In practical use they should not see this or need to know much about it just the result of less "failures" on top of root failures to distract them.

Are there any open questions?

What are some of the open questions and potential scenarios that should be considered?

AlexOteiza commented 2 years ago

The reporter should already cover this. The tests if they didn't run they will be set as UNKNOWN or TIMEOUT state. TIMEOUT will be set for the test that timed out and UNKNOWN for the rest of the tests that didn't run.

I didn't implement any of re-running of tests because if there is a fundamental issue with the editor that times out every single tests, the tests can take more than 10 hours to run. From what point Jenkins will kill the test run and make more difficult to know what was the issue. This is something that has happened in the past and was a pain to diagnose..

I advise against re-running the tests, as this will also add a lot of complexity to the already complex code that the editor batching code does.

If you want we can setup a meeting to go a bit deeper into this if needed.

Kadino commented 2 years ago

The benefit of attempting to run tests which never executed would be to provide a more-full test failure report about what internal functionality still works. However the presence of failure should be the most important signal.

I don't think reruns should be attempted if a batch of tests times out, especially if this is done recursively. I suggest it is of limited value to (re)run additional tests when a crash is detected, but would be interested to have more information about why this is painful.

AlexOteiza commented 2 years ago

If we limit the amount of re-runs to a sensible number(ex: 3), that should cover the issue of long times to run the tests.

In any case if the root issue is that the tests are shown as failed, why don’t we just mark them in Jenkins with a warning or something similar? I am worried about the code complexity that re-running tests would bring into the already existing complex code and difficult to diagnose failures that this could put into the pipeline

smurly commented 2 years ago

In any case if the root issue is that the tests are shown as failed, why don’t we just mark them in Jenkins with a warning or something similar?

I think this is the root of the issue. tests which have no result are currently marked as failed but they are never even executed. This results in cursory examination of the output to suggest lists of tests have failed when in fact only one test may have failed. This contributes to confusion over which test actually failed and an overall perception that many tests are failing without reason (leading to lowered confidence in results). Mark them as DNR (did not run) or inconclusive rather than Failed. The counter to this is tests should have Boolean outcome state Pass/Fail, however this is the null state as in they do not have an outcome.

Report summary may count them as unknown outcome, but ctest and other layers above this are counting them as failed and listing them among the failed tests. Only in a deeper examination of log output would one see they are tests not run.

Kadino commented 2 years ago

Report summary may count them as unknown outcome, but ctest and other layers above this are counting them as failed and listing them among the failed tests. Only in a deeper examination of log output would one see they are tests not run.

From CTest's standpoint the module has failed, and the report is accurate. I think this is highlighting that the PyTest-level reporting is what was misleading, including the Test XML summary that appears in Jenkins.

smurly commented 2 years ago

Right, the pytest level of reporting currently lists them as failed with command lines for rerunning failed tests. The discussion typically starts with copy paste of a list of tests which did not actually run but say Failed and ensuing claims that the PR is not related to these failures. Deeper in the log output with json snippet of Report.summary you might see they didn't actually run. we should surface actual failure indications more prominently and not mark individual tests which have no result as failed.

o3de / sig-testing