launch_testing `@post_shutdown_test()` checking output and return codes is flaky

sloretz commented 3 years ago

Bug report

Required Info:

Operating System:
- Ubuntu Focal
Installation type:
- Source
Version or commit hash:
- 026ab0e6c5a84c5d6fed2b643a83999b6e873dea
DDS implementation:
- Default
Client library (if applicable):
- N/A

Steps to reproduce issue

Save the following as mre.py

import unittest

from launch import LaunchDescription
from launch.actions import ExecuteProcess
from launch_testing.actions import ReadyToTest
from launch_testing import post_shutdown_test
from launch_testing.asserts import assertExitCodes
from launch_testing.asserts import assertInStdout
import pytest

@pytest.mark.rostest
def generate_test_description():
    put = ExecuteProcess(cmd=('echo', 'foobar'))
    return LaunchDescription([put, ReadyToTest()]), {'put': put}

@post_shutdown_test()
class TestAfterShutdown(unittest.TestCase):

    def test_error_message(self, proc_output, put):
        assertInStdout(proc_output, 'foobar', put)

    def test_exit(self, proc_info):
        assertExitCodes(proc_info, allowable_exit_codes=[0])

Use launch_test command to run it

$ launch_test mre.py

Expected behavior

The test would always succeeds

Actual behavior

The test fails most, but not all of the time (fails about 90% of the time on my machine)

Additional information

When it succeeds the output is:

$ launch_test mre.py 
[INFO] [launch]: All log files can be found below /home/osrf/.ros/log/2021-09-23-16-34-26-384599-shane-1662924
[INFO] [launch]: Default logging verbosity is set to INFO

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK
[INFO] [echo-1]: process started with pid [1662927]
[INFO] [echo-1]: sending signal 'SIGINT' to process[echo-1]
[INFO] [echo-1]: process has finished cleanly [pid 1662927]
test_error_message (mre.TestAfterShutdown) ... ok
test_exit (mre.TestAfterShutdown) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.000s

OK

When it fails the output is:

$ launch_test mre.py 
[INFO] [launch]: All log files can be found below /home/osrf/.ros/log/2021-09-23-16-34-25-739192-shane-1662919
[INFO] [launch]: Default logging verbosity is set to INFO

----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK
[INFO] [echo-1]: process started with pid [1662922]
[INFO] [echo-1]: sending signal 'SIGINT' to process[echo-1]
[ERROR] [echo-1]: process has died [pid 1662922, exit code -2, cmd 'echo foobar'].
test_error_message (mre.TestAfterShutdown) ... ERROR
test_exit (mre.TestAfterShutdown) ... FAIL

======================================================================
ERROR: test_error_message (mre.TestAfterShutdown)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/osrf/ws/third_party/src/mre.py", line 22, in test_error_message
    assertInStdout(proc_output, 'foobar', put)
  File "/home/osrf/ws/ros2/install/launch_testing/lib/python3.8/site-packages/launch_testing/asserts/assert_output.py", line 121, in assertInStdout
    assertInStream(proc_output, expected_output, process, cmd_args, output_filter=output_filter,
  File "/home/osrf/ws/ros2/install/launch_testing/lib/python3.8/site-packages/launch_testing/asserts/assert_output.py", line 72, in assertInStream
    resolved_procs = resolveProcesses(
  File "/home/osrf/ws/ros2/install/launch_testing/lib/python3.8/site-packages/launch_testing/util/proc_lookup.py", line 112, in resolveProcesses
    raise NoMatchingProcessException(
launch_testing.util.proc_lookup.NoMatchingProcessException: No data recorded for proc echo-1 foobar

======================================================================
FAIL: test_exit (mre.TestAfterShutdown)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/osrf/ws/third_party/src/mre.py", line 25, in test_exit
    assertExitCodes(proc_info, allowable_exit_codes=[0])
  File "/home/osrf/ws/ros2/install/launch_testing/lib/python3.8/site-packages/launch_testing/asserts/assert_exit_codes.py", line 62, in assertExitCodes
    assert info.returncode in allowable_exit_codes, 'Proc {} exited with code {}'.format(
AssertionError: Proc echo-1 exited with code -2

----------------------------------------------------------------------
Ran 2 tests in 0.000s

FAILED (failures=1, errors=1)

Is the @post_shutdown_test() letting the test run before the process shuts down?

sloretz commented 3 years ago

I think this is the cause of the test failures in CI for https://github.com/ros/joint_state_publisher/pull/67

hidmic commented 3 years ago

@sloretz I suspect the test is racing with asyncio.subprocess IO (at best, assuming foobar actually made it back through pipes). I'll fiddle with this a bit to confirm.

hidmic commented 3 years ago

Alright, I can only reproduce it by putting the machine under stress. It looks as if launch_testing is shutting down the fixture (by firing SIGINTs) before echo is even scheduled. There's no pre-shutdown test, so this MRE is inherently racy.

@sloretz consider waiting for and asserting the exit code in a pre-shutdown test.

hidmic commented 3 years ago

@clalancette @sloretz let me know what you want to do about this ticket. Perhaps some documentation somewhere?

sloretz commented 3 years ago

@sloretz consider waiting for and asserting the exit code in a pre-shutdown test. @clalancette @sloretz let me know what you want to do about this ticket. Perhaps some documentation somewhere?

I would expect any post-shutdown information to be ready to retrieve in a @post_shutdown_test(). Can launch_test wait until the exit code is ready before calling the post-shutdown test?

hidmic commented 3 years ago

I would expect any post-shutdown information to be ready to retrieve in a @post_shutdown_test().

It is available -- not the one you expect though, because of early signaling.

Can launch_test wait until the exit code is ready before calling the post-shutdown test?

We can force launch_testing to wait for the launch service to go idle. #528 supports that. We could do something similar here.

clalancette commented 3 years ago

For me, there are two sides to this.

One is the "legacy" launch_testing framework. While I think we can make improvements there, I think we should probably keep changes minimal. That said, because we still have to support launch_testing in Foxy until 2023, we should document how to do some of this stuff more nicely.

The other side is the "new" pytest-based testing framework being done in #528. There I would want to make sure that this is much more intuitive for the end-user (which it sounds like it is).

dpiet commented 2 years ago

I also came across this issue. Is there a workaround for Foxy?

ros2 / launch