o3de / o3de

Open 3D Engine (O3DE) is an Apache 2.0-licensed multi-platform 3D engine that enables developers and content creators to build AAA games, cinema-quality 3D worlds, and high-fidelity simulations without any fees or commercial obligations.
https://o3de.org
Other
7.76k stars 2.19k forks source link

AR Bug Report: Intermittent test failure, SmokeTest #11794

Open jcurtisk opened 2 years ago

jcurtisk commented 2 years ago

Describe the bug Test failure on an AR run with unrelated changes. AutomatedTesting::SmokeTest.smoke::TEST_RUN failed on the test case test_CLITool_SerializeContextTools_Works.py::TestCLIToolSerializeContextToolsWorks::test_CLITool_SerializeContextTools_Works Failed Jenkins Job Information: [Failed pipeline] https://jenkins.build.o3de.org/job/O3DE/job/PR-11793/1/) Windows/test_cpu_profile failed.

Attachments log.txt

Additional context This error has been reported before here

Kadino commented 2 years ago

This test only verifies that executing the --help command does not cause the application to crash or hang.

Assigned to @o3de/sig-core per https://github.com/o3de/o3de/blob/development/.github/CODEOWNERS#L35

jckand commented 1 year ago

Same issue occurs with test_CLITool_PythonBindingsExample_Works. There does not appear to have been any movement in investigating why these tests are intermittently failing. Both of these tests should likely be disabled as it doesn't appear they're adding much value.

hultonha commented 1 year ago

I just hit this issue again today 😖

[2022-11-25T11:14:48.732Z] ..\..\..\..\..\..\AutomatedTesting\Gem\PythonTests\smoke\test_CLITool_SerializeContextTools_Works.py::TestCLIToolSerializeContextToolsWorks::test_CLITool_SerializeContextTools_Works FAILED [ 75%]

See https://jenkins.build.o3de.org/blue/organizations/jenkins/O3DE/detail/PR-13383/1/pipeline/827

Log: log(2).txt

Any update on this issue @Kadino and/or @jckand-amzn? Can we please move this to Sandbox while it's investigated otherwise please. Thanks!

Kadino commented 1 year ago

I think @o3de/sig-core should maintain any tests of SerializeContextTools. I support off-boarding this test on their behalf.

hultonha commented 1 year ago

Anyone from @o3de/sig-core (@lumberyard-employee-dm?) able to remove this for now then if possible please?

lemonade-dm commented 1 year ago

Well I'll look into why it is failing. All these test to is launch an application with the --help option, which only uses AZ_Printf to output some usage test. There is no reason it should fail

lemonade-dm commented 1 year ago

Running the SerializeContextTools.exe -help command 10000 times through powershell succeeded without failure. I ran with the help argument in both the debug and profile configuration.

PS C:\code\o3de\build\windows\bin\profile> for ($i=0; $i -lt 10000; ++$i) { ./SerializeContextTools.exe --help }

Based on my investigation, the issue with this test is within the Python and not the actual SerialzieContextTools

lemonade-dm commented 1 year ago

Also launching the application in the profile takes at most 230 milliseconds which is well below the 10 seconds timeout in the python script.

The failure that occurred on Jenkins is due to timeout, which points to the machine being overloaded with resource utilization that it took more than 10 seconds to launch the application.

[2022-11-25T11:14:48.732Z] ================================== FAILURES ===================================
[2022-11-25T11:14:48.732Z] _ TestCLIToolSerializeContextToolsWorks.test_CLITool_SerializeContextTools_Works _
[2022-11-25T11:14:48.732Z] D:\workspace\o3de\AutomatedTesting\Gem\PythonTests\smoke\test_CLITool_SerializeContextTools_Works.py:25: in test_CLITool_SerializeContextTools_Works
[2022-11-25T11:14:48.732Z]     output = subprocess.run([file_path, "--help"], capture_output=True, timeout=10)
[2022-11-25T11:14:48.732Z] D:\workspace\o3de\python\runtime\python-3.10.5-rev1-windows\python\lib\subprocess.py:503: in run
[2022-11-25T11:14:48.732Z]     stdout, stderr = process.communicate(input, timeout=timeout)
[2022-11-25T11:14:48.732Z] D:\workspace\o3de\python\runtime\python-3.10.5-rev1-windows\python\lib\subprocess.py:1152: in communicate
[2022-11-25T11:14:48.732Z]     stdout, stderr = self._communicate(input, endtime, timeout)
[2022-11-25T11:14:48.732Z] D:\workspace\o3de\python\runtime\python-3.10.5-rev1-windows\python\lib\subprocess.py:1528: in _communicate
[2022-11-25T11:14:48.732Z]     raise TimeoutExpired(self.args, orig_timeout)
[2022-11-25T11:14:48.732Z] E   subprocess.TimeoutExpired: Command '['D:/workspace/o3de/build/windows/bin/profile\\SerializeContextTools', '--help']' timed out after 10 seconds
[2022-11-25T11:14:48.732Z] ============================== warnings summary ===============================
nick-l-o3de commented 1 year ago

If a machine is loaded it can take like 100x longer to start a process. The record i've seen is 8 minutes from starting a thread to seeing the first instruction in it. It could be a deadlock though

It can also take especially long if the tool has to load gem dlls.

For test situations where there is a timeout but the test is expected to finish very quickly and rarely, if ever, time out unless something is wrong, making a short timeout can make the test flaky. So usually you want to grossly overestimate the timeout required (as long as the test finishes instantly when its normally working, ie, never actually wait for that full timeout unless somethings wrong)

lemonade-dm commented 1 year ago

Well in this case the test isn't loading any dlls. It is launching a main function that prints out the help usage text

hultonha commented 1 year ago

FYI this test failed again recently here -

Log output can be found here - log.txt