SIGABRT: Too many open files

yashi commented 1 year ago

Describe the bug This is more like a question than a bug report. It is indeed a problem for me but I might have done in a wrong way. Anyway, here goes.

I have micro-ROS setup in my lab. I have one Agent and some nodes running on my PC and a few micro-ROS nodes running on a MCU. Since I'm debugging my micro-ROS nodes, I frequently restarts my MCU with micro-ROS nodes. While doing so, the agents started to die with the following message:

terminate called after throwing an instance of 'std::system_error'
  what():  eventfd_select_interrupter: Too many open files

Thread 8 "micro_ros_agent" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff23f96c0 (LWP 386018)]

Every time I restart, meaning power-cycling, the MCU, I see a set of new threads created on the Agent.

This is just after start:

$ ros2 run --prefix 'gdb -ex run --args' micro_ros_agent micro_ros_agent udp4 --port 8888
GNU gdb (Debian 13.2-1) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/yashi/work/yoshida/agent-ws/install/lib/micro_ros_agent/micro_ros_agent...
Starting program: /home/yashi/work/yoshida/agent-ws/install/lib/micro_ros_agent/micro_ros_agent udp4 --port 8888
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff53ff6c0 (LWP 448395)]
[New Thread 0x7ffff4bfe6c0 (LWP 448396)]
[New Thread 0x7ffff43fd6c0 (LWP 448397)]
[1695638338.755919] info     | UDPv4AgentLinux.cpp | init                     | running...             | port: 8888
[New Thread 0x7ffff3bfc6c0 (LWP 448398)]
[New Thread 0x7ffff33fb6c0 (LWP 448399)]
[New Thread 0x7ffff2bfa6c0 (LWP 448400)]
[New Thread 0x7ffff23f96c0 (LWP 448401)]
[New Thread 0x7ffff1bf86c0 (LWP 448402)]
[1695638338.756566] info     | Root.cpp           | set_verbose_level        | logger setup           | verbose_level: 4

Then we I start the MCU, additional message displayed:

[1695638522.802447] info     | Root.cpp           | create_client            | create                 | client_key: 0x0542EA63, session_id: 0x81
[1695638522.802642] info     | SessionManager.hpp | establish_session        | session established    | client_key: 0x0542EA63, address: 10.30.1.102:45528
[New Thread 0x7ffff13f76c0 (LWP 450585)]
[New Thread 0x7ffff0bf66c0 (LWP 450586)]
[New Thread 0x7fffe3fff6c0 (LWP 450587)]
[New Thread 0x7fffe37fe6c0 (LWP 450588)]
[New Thread 0x7fffe2ffd6c0 (LWP 450589)]
[New Thread 0x7fffe27fc6c0 (LWP 450590)]
[New Thread 0x7fffe1ffb6c0 (LWP 450591)]
[New Thread 0x7fffe17fa6c0 (LWP 450592)]
[New Thread 0x7fffe0ff96c0 (LWP 450593)]
[New Thread 0x7fffd7fff6c0 (LWP 450594)]
[New Thread 0x7fffd77fe6c0 (LWP 450595)]
[New Thread 0x7fffd6ffd6c0 (LWP 450596)]
[New Thread 0x7fffd67fc6c0 (LWP 450597)]
[New Thread 0x7fffd5ffb6c0 (LWP 450598)]
[New Thread 0x7fffd57fa6c0 (LWP 450599)]
[1695638522.829744] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x000(1)
[1695638522.831119] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x000(2), participant_id: 0x000(1)
[1695638522.831973] info     | ProxyClient.cpp    | create_subscriber        | subscriber created     | client_key: 0x0542EA63, subscriber_id: 0x000(4), participant_id: 0x000(1)
[1695638523.164441] info     | ProxyClient.cpp    | create_datareader        | datareader created     | client_key: 0x0542EA63, datareader_id: 0x000(6), subscriber_id: 0x000(4)
[New Thread 0x7fffd4ff96c0 (LWP 450600)]
[New Thread 0x7fffc7fff6c0 (LWP 450601)]
[New Thread 0x7fffc77fe6c0 (LWP 450602)]
[New Thread 0x7fffc6ffd6c0 (LWP 450603)]
[New Thread 0x7fffc67fc6c0 (LWP 450604)]
[New Thread 0x7fffc5ffb6c0 (LWP 450605)]
[New Thread 0x7fffc57fa6c0 (LWP 450606)]
[1695638523.173110] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x001(1)
[1695638523.174367] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x001(2), participant_id: 0x001(1)
[1695638523.175187] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x000(3), participant_id: 0x001(1)
[1695638523.176740] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x000(5), publisher_id: 0x000(3)
[New Thread 0x7fffc4ff96c0 (LWP 450607)]
[New Thread 0x7fff97fff6c0 (LWP 450608)]
[New Thread 0x7fff977fe6c0 (LWP 450609)]
[New Thread 0x7fff96ffd6c0 (LWP 450610)]
[New Thread 0x7fff967fc6c0 (LWP 450611)]
[New Thread 0x7fff95ffb6c0 (LWP 450612)]
[1695638523.181932] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x002(1)
[1695638523.183732] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x002(2), participant_id: 0x002(1)
[1695638523.184433] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x001(3), participant_id: 0x002(1)
[1695638523.185774] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x001(5), publisher_id: 0x001(3)
[New Thread 0x7fff957fa6c0 (LWP 450613)]
[New Thread 0x7fff94ff96c0 (LWP 450614)]
[New Thread 0x7fff87fff6c0 (LWP 450615)]
[New Thread 0x7fff877fe6c0 (LWP 450616)]
[New Thread 0x7fff86ffd6c0 (LWP 450617)]
[New Thread 0x7fff867fc6c0 (LWP 450618)]
[1695638523.190855] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x003(1)
[1695638523.192624] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x003(2), participant_id: 0x003(1)
[1695638523.193318] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x002(3), participant_id: 0x003(1)
[1695638523.194676] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x002(5), publisher_id: 0x002(3)
[New Thread 0x7fff85ffb6c0 (LWP 450619)]
[New Thread 0x7fff857fa6c0 (LWP 450620)]
[New Thread 0x7fff84ff96c0 (LWP 450621)]
[New Thread 0x7fff73fff6c0 (LWP 450622)]
[New Thread 0x7fff737fe6c0 (LWP 450623)]
[New Thread 0x7fff72ffd6c0 (LWP 450624)]
[1695638523.247237] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x004(1)
[1695638523.334084] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x004(2), participant_id: 0x004(1)
[1695638523.334912] info     | ProxyClient.cpp    | create_subscriber        | subscriber created     | client_key: 0x0542EA63, subscriber_id: 0x001(4), participant_id: 0x004(1)
[1695638523.337061] info     | ProxyClient.cpp    | create_datareader        | datareader created     | client_key: 0x0542EA63, datareader_id: 0x001(6), subscriber_id: 0x001(4)
[New Thread 0x7fff727fc6c0 (LWP 450625)]
[1695638523.338470] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x005(2), participant_id: 0x004(1)
[1695638523.339317] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x003(3), participant_id: 0x004(1)
[1695638523.466150] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x003(5), publisher_id: 0x003(3)
[New Thread 0x7fff71ffb6c0 (LWP 450626)]
[New Thread 0x7fff717fa6c0 (LWP 450627)]
[New Thread 0x7fff70ff96c0 (LWP 450628)]
[New Thread 0x7fff57fff6c0 (LWP 450629)]
[New Thread 0x7fff577fe6c0 (LWP 450630)]
[New Thread 0x7fff56ffd6c0 (LWP 450631)]
[1695638523.479161] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x005(1)
[1695638523.481011] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x006(2), participant_id: 0x005(1)
[1695638523.482010] info     | ProxyClient.cpp    | create_subscriber        | subscriber created     | client_key: 0x0542EA63, subscriber_id: 0x002(4), participant_id: 0x005(1)
[1695638523.483753] info     | ProxyClient.cpp    | create_datareader        | datareader created     | client_key: 0x0542EA63, datareader_id: 0x002(6), subscriber_id: 0x002(4)
[New Thread 0x7fff567fc6c0 (LWP 450632)]
[1695638523.484837] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x007(2), participant_id: 0x005(1)
[1695638523.485596] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x004(3), participant_id: 0x005(1)
[1695638523.599551] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x004(5), publisher_id: 0x004(3)
[New Thread 0x7fff55ffb6c0 (LWP 450634)]
[New Thread 0x7fff557fa6c0 (LWP 450635)]
[New Thread 0x7fff54ff96c0 (LWP 450636)]
[New Thread 0x7fff33fff6c0 (LWP 450637)]
[New Thread 0x7fff337fe6c0 (LWP 450638)]
[New Thread 0x7fff32ffd6c0 (LWP 450639)]
[1695638523.608012] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x006(1)
[1695638523.624584] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x008(2), participant_id: 0x006(1)
[1695638523.625307] info     | ProxyClient.cpp    | create_subscriber        | subscriber created     | client_key: 0x0542EA63, subscriber_id: 0x003(4), participant_id: 0x006(1)
[1695638523.626816] info     | ProxyClient.cpp    | create_datareader        | datareader created     | client_key: 0x0542EA63, datareader_id: 0x003(6), subscriber_id: 0x003(4)
[New Thread 0x7fff327fc6c0 (LWP 450640)]
[1695638523.627937] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x009(2), participant_id: 0x006(1)
[1695638523.628667] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x005(3), participant_id: 0x006(1)
[1695638523.739558] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x005(5), publisher_id: 0x005(3)
[New Thread 0x7fff31ffb6c0 (LWP 450641)]
[New Thread 0x7fff317fa6c0 (LWP 450642)]
[New Thread 0x7fff30ff96c0 (LWP 450643)]
[New Thread 0x7fff1bfff6c0 (LWP 450644)]
[New Thread 0x7fff1b7fe6c0 (LWP 450645)]
[New Thread 0x7fff1affd6c0 (LWP 450646)]
[1695638523.748872] info     | ProxyClient.cpp    | create_participant       | participant created    | client_key: 0x0542EA63, participant_id: 0x007(1)
[1695638523.751435] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x00A(2), participant_id: 0x007(1)
[1695638523.752189] info     | ProxyClient.cpp    | create_subscriber        | subscriber created     | client_key: 0x0542EA63, subscriber_id: 0x004(4), participant_id: 0x007(1)
[1695638523.754146] info     | ProxyClient.cpp    | create_datareader        | datareader created     | client_key: 0x0542EA63, datareader_id: 0x004(6), subscriber_id: 0x004(4)
[New Thread 0x7fff1a7fc6c0 (LWP 450647)]
[1695638523.755267] info     | ProxyClient.cpp    | create_topic             | topic created          | client_key: 0x0542EA63, topic_id: 0x00B(2), participant_id: 0x007(1)
[1695638523.755995] info     | ProxyClient.cpp    | create_publisher         | publisher created      | client_key: 0x0542EA63, publisher_id: 0x006(3), participant_id: 0x007(1)
[1695638523.757866] info     | ProxyClient.cpp    | create_datawriter        | datawriter created     | client_key: 0x0542EA63, datawriter_id: 0x006(5), publisher_id: 0x006(3)

It's fine to create new threads, but I don't see, in source code, where these threads are retrieved or freed. It just goes on and on while in my debugging session and finally dies with the "too many open files".

To Reproduce I don't have exact steps since I'm using my own code. I want to ask anyone using micro-ROS agent how the situation is. Is this just me?

Expected behaviour I want the agent to retrieve unused threads. I assume it's quite difficult since we never know when the old node sends a new message. It could be long duration before sending second message.

OTOH, what was the design choise for this? How can I run agent in long time without leaking resources?

System information (please complete the following information):

OS: Debian Sid
ROS 2: Humble
Agenet Version: 3.0.5 (1b815304e9432bb843d7258d8e38594954f79bab)

gavanderhoorn commented 1 year ago

This seems like something Reconnections and liveliness documents/discusses.

We've been using the hard liveness check with great success here.

pablogs9 commented 1 year ago

Hello, the underlying Fast DDS implementation creates threads when a new DDS Domain Participant is created, this is out of the scope of the micro-ROS Agent codebase. They are destroyed when the DDS Domain Participants are destroyed.

As @gavanderhoorn mentioned, if you reset the MCU, there is no automated DDS Domain Participant destruction procedure.

Normally if your MCUs are halting without a controlled destruction, it is recommended to:

Use hard liveliness check: this will destroy entities created by an MCU if the MCU has been more than N seconds not alive.
Reusing entities by means of reusing a known XRCE DDS client key for each MCU, this option requires some modifications in micro-ROS RMW because by now it is initialized randomly: https://github.com/micro-ROS/rmw_microxrcedds/blob/bc4eb312ac4601a4137c35f4a56b9b83b4b18339/rmw_microxrcedds_c/src/rmw_init.c#L115

yashi commented 1 year ago

Thank you guys! I'll try them and re-open if I still have the problem.

yashi commented 1 year ago

With UCLIENT_HARD_LIVELINESS_CHECK set to ON in microxrcedds-client, I see a bunch of thread exit messages! Thank you again!

[Thread 0x7fff327fc6c0 (LWP 468141) exited]
[Thread 0x7fff98ff96c0 (LWP 468116) exited]
[Thread 0x7fff8bfff6c0 (LWP 468117) exited]
[Thread 0x7fff8b7fe6c0 (LWP 468118) exited]
[Thread 0x7fff8affd6c0 (LWP 468119) exited]
[Thread 0x7fff997fa6c0 (LWP 468115) exited]
  :

micro-ROS / micro-ROS-Agent

SIGABRT: Too many open files #207