Open asymingt opened 1 month ago
@wjwwood Could you please help to triage this issue?
@wjwwood, here's a standalone example that spawns 100 TF2 static broadcasters with their own log files, and then repeats this test 100 times to stress the logging system. Set ulimit -n 1024
and you should be able to easily open all the ROS shared libraries and log files for one test. Then, if you run the following test with pytest -vs ...
you will see the first couple tests run until the unclosed logs hit the barrier and you 'll get a OSError: [Errno 24] Too many open files
.
import pytest
import launch
import launch_pytest
import launch_ros
import random
import string
ALL_TEST_NAMES = [f"test_{k}" for k in range(0, 100)]
TEST_DURATION = 5.0
def gen_random_frame_name(N : int) -> string:
"""Generate a random frame name of N characters"""
return ''.join(random.choices(string.ascii_uppercase + string.digits, k=N))
@pytest.fixture(scope='function')
def test_context(request):
"""Indirect param access to get the test name"""
return request.param
@launch_pytest.fixture(scope='function')
def launch_description(test_context):
# Launch a bunch of nodes, each with their own log file.
nodes = [
launch_ros.actions.Node(
name=f"{test_context}_node_{k}",
package="tf2_ros",
executable="static_transform_publisher",
arguments=[
"0",
"0",
"0",
"0",
"0",
"0",
gen_random_frame_name(8),
gen_random_frame_name(8),
],
output='own_log')
for k in range(0, 100)
]
# Shutdown this launch after TEST_DURATION seconds.
shutdown_timer = launch.actions.TimerAction(
period=TEST_DURATION,
actions=[
launch.actions.Shutdown(reason='launch is shutting down')
],
)
return launch.LaunchDescription(nodes + [
shutdown_timer,
launch_pytest.actions.ReadyToTest()
])
@pytest.mark.parametrize('test_context', ALL_TEST_NAMES, indirect=True)
@pytest.mark.launch(fixture=launch_description)
def test_execution(test_context):
yield
Any progress on this? Is there anything I can do on my side to help?
Any progress on this? Is there anything I can do on my side to help?
Just to give you some kind of answer; we don't really use this functionality very much in the core. And the person who initially developed it is no longer working on ROS 2. So we don't have a lot of great insight into what's going on here.
My best suggestion for now is to dig into it and see what is happening. We'll happily review pull requests to improve it.
I did some work on this and confirm that this bug still affects Jazzy, and I suspect it's because launch.logging.get_output_loggers(...)
calls made on these lines results in a file handler being opened and stored to the global launch.logging.launch_config
here. I think that the ExecuteLocal
class should probably close its own two specific loggers on cleanup here.
This workaround resets the logging system after each test completes and fixes the exception in my example:
@pytest.mark.parametrize('test_context', ALL_TEST_NAMES, indirect=True)
@pytest.mark.launch(fixture=launch_description)
def test_execution(test_context):
yield
launch.logging.reset() # <------ THIS LINE
As a consequence, the above line forces each parameterized test now to create a new directory for its log files, which in a sense is actually more desirable, so I might just go with this solution.
Bug report
When
@pytest.mark.parametrize
is used in a function scope withlaunch_pytest
the node log files from each test parameterization remain open even when execution of that specific parameterized test is complete. For large projects with dozens or hundreds of nodes this eventually causes aOSError: [Errno 24] Too many open files
error.Required Info:
Steps to reproduce issue
The code below cannot be run directly, but illustrates the issue. The
TestContext
object represents a specific test setup we want to run, and is constructed by name from a list of acceptable names inALL_TEST_NAMES
. This allows us to cherry-pick a subset of tests to run from our entire suite, withpytest -k
orpytest --regex
. The idea is each tests is run in sequence, against a new execution context and saved the results to aTEST_RESULTS
data structure in the global context. When the test session ends, theAnalyzeResults
function is called to analyze all results from all tests.Expected behavior
Test completes without throwing an error.
Actual behavior
Python throws an internal error
OSError: [Errno 24] Too many open files
.This only happens several tests into the sequence, pointing to the fact that something internally is not cleaning up after itself. I have painfully tracked
lsof -p <pid_of_pytest_invocation>
over time and realized that the ROS2 node log files created by each parameterized test are not being closed, even though the stack has shut down. So, let's say we havek
nodes and thereforek
log files created by ROS per launch test that is run. If I runn
parameterized tests sequentially the number of log files open at testi
, wherei
is in{1 ... n}
, will bei * k
.Additional information
You can always increase the number of files that may be opened at once with
ulimit
but this is a bandaid and not a solution.