Parameterized `launch_pytest` tests don't close node log files

asymingt commented 1 month ago

Bug report

When @pytest.mark.parametrize is used in a function scope with launch_pytest the node log files from each test parameterization remain open even when execution of that specific parameterized test is complete. For large projects with dozens or hundreds of nodes this eventually causes a OSError: [Errno 24] Too many open files error.

Required Info:

Operating System: Ubuntu Linux 20.04 amd64
Installation type: source
Version or commit hash: 9e517a3abd016367244fff9fa96ac8eae371a00b
DDS implementation: RTI Connext
Client library (if applicable): rclpy

Steps to reproduce issue

The code below cannot be run directly, but illustrates the issue. The TestContext object represents a specific test setup we want to run, and is constructed by name from a list of acceptable names in ALL_TEST_NAMES. This allows us to cherry-pick a subset of tests to run from our entire suite, with pytest -k or pytest --regex. The idea is each tests is run in sequence, against a new execution context and saved the results to a TEST_RESULTS data structure in the global context. When the test session ends, the AnalyzeResults function is called to analyze all results from all tests.

import pytest

import launch
import launch_pytest

from my_project.test_framework import TestContext, ALL_TEST_NAMES, AnalyzeResults

TEST_RESULTS = []

@pytest.fixture(scope='function')
def test_context(request):
    test_context = TestContext(launch_test=request.param)
    test_context.do_some_preprocessing_steps()
    return test_context

@launch_pytest.fixture(scope='function')
def launch_description(test_context):
    return launch.LaunchDescription(
        test_context.get_all_actions() + [launch_pytest.actions.ReadyToTest()])

@pytest.mark.parametrize('test_context', ALL_TEST_NAMES, indirect=True)
@pytest.mark.launch(fixture=launch_description)
def test_execution(test_context):
    results = test_context.do_some_post_postprocessing_steps()
    TEST_RESULTS.append(results)

@pytest.fixture(scope='session', autouse=True)
def run_assertions():
    yield
    AnalyzeResults(TEST_RESULTS)

Expected behavior

Test completes without throwing an error.

Actual behavior

Python throws an internal error OSError: [Errno 24] Too many open files.

This only happens several tests into the sequence, pointing to the fact that something internally is not cleaning up after itself. I have painfully tracked lsof -p <pid_of_pytest_invocation> over time and realized that the ROS2 node log files created by each parameterized test are not being closed, even though the stack has shut down. So, let's say we have k nodes and therefore k log files created by ROS per launch test that is run. If I run n parameterized tests sequentially the number of log files open at test i, where i is in {1 ... n}, will be i * k.

Additional information

You can always increase the number of files that may be opened at once with ulimit but this is a bandaid and not a solution.

MichaelOrlov commented 1 month ago

@wjwwood Could you please help to triage this issue?

asymingt commented 2 weeks ago

@wjwwood, here's a standalone example that spawns 100 TF2 static broadcasters with their own log files, and then repeats this test 100 times to stress the logging system. Set ulimit -n 1024 and you should be able to easily open all the ROS shared libraries and log files for one test. Then, if you run the following test with pytest -vs ... you will see the first couple tests run until the unclosed logs hit the barrier and you 'll get a OSError: [Errno 24] Too many open files.

import pytest
import launch
import launch_pytest
import launch_ros
import random
import string

ALL_TEST_NAMES = [f"test_{k}" for k in range(0, 100)]
TEST_DURATION = 5.0

def gen_random_frame_name(N : int) -> string:
    """Generate a random frame name of N characters"""
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=N))

@pytest.fixture(scope='function')
def test_context(request):
    """Indirect param access to get the test name"""
    return request.param

@launch_pytest.fixture(scope='function')
def launch_description(test_context):
    # Launch a bunch of nodes, each with their own log file.
    nodes = [
        launch_ros.actions.Node(
            name=f"{test_context}_node_{k}",
            package="tf2_ros",
            executable="static_transform_publisher",
            arguments=[
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                gen_random_frame_name(8),
                gen_random_frame_name(8),
            ],
            output='own_log')
        for k in range(0, 100)
    ]
    # Shutdown this launch after TEST_DURATION seconds.
    shutdown_timer = launch.actions.TimerAction(
        period=TEST_DURATION,
        actions=[
            launch.actions.Shutdown(reason='launch is shutting down')
        ],
    )
    return launch.LaunchDescription(nodes + [
        shutdown_timer,
        launch_pytest.actions.ReadyToTest()
    ])

@pytest.mark.parametrize('test_context', ALL_TEST_NAMES, indirect=True)
@pytest.mark.launch(fixture=launch_description)
def test_execution(test_context):
    yield

asymingt commented 1 week ago

Any progress on this? Is there anything I can do on my side to help?

clalancette commented 1 week ago

Any progress on this? Is there anything I can do on my side to help?

Just to give you some kind of answer; we don't really use this functionality very much in the core. And the person who initially developed it is no longer working on ROS 2. So we don't have a lot of great insight into what's going on here.

My best suggestion for now is to dig into it and see what is happening. We'll happily review pull requests to improve it.

asymingt commented 1 week ago

I did some work on this and confirm that this bug still affects Jazzy, and I suspect it's because launch.logging.get_output_loggers(...) calls made on these lines results in a file handler being opened and stored to the global launch.logging.launch_config here. I think that the ExecuteLocal class should probably close its own two specific loggers on cleanup here.

This workaround resets the logging system after each test completes and fixes the exception in my example:

@pytest.mark.parametrize('test_context', ALL_TEST_NAMES, indirect=True)
@pytest.mark.launch(fixture=launch_description)
def test_execution(test_context):
    yield
    launch.logging.reset()    # <------ THIS LINE

As a consequence, the above line forces each parameterized test now to create a new directory for its log files, which in a sense is actually more desirable, so I might just go with this solution.

ros2 / launch