Ros2 node list stops working after unrelated node crashes

Kaju-Bubanja commented 2 years ago

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- from source(master)
Version or commit hash:
- 44a654c9077252eed50b2a6d9034ba47b634bf98(master)
DDS implementation:
- rmw_fastrtps_cpp
Client library (if applicable):
- N/A

Steps to reproduce issue

Unfortunately I don't have detailed steps to reproduce, which makes it so hard to debug. The error might not happen for days and then turn up again constantly. I have various nodes which do things and somehow one of them must brick the whole ros system. I'm not sure how this is even possible, but after that happens only a complete reboot of the computer helps. If I try to use commands like ros2 node list or ros2 service list I get following stacktrace

Traceback (most recent call last):
  File "/home/username/ros2_rolling/install/ros2cli/bin/ros2", line 11, in <module>
    load_entry_point('ros2cli', 'console_scripts', 'ros2')()
  File "/home/username/ros2_rolling/build/ros2cli/ros2cli/cli.py", line 89, in main
    rc = extension.main(parser=parser, args=args)
  File "/home/username/ros2_rolling/build/ros2node/ros2node/command/node.py", line 37, in main
    return extension.main(args=args)
  File "/home/username/ros2_rolling/build/ros2node/ros2node/verb/list.py", line 38, in main
    node_names = get_node_names(node=node, include_hidden_nodes=args.all)
  File "/home/username/ros2_rolling/build/ros2node/ros2node/api/__init__.py", line 60, in get_node_names
    node_names_and_namespaces = node.get_node_names_and_namespaces()
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1109, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1450, in __request
    response = self.__transport.request(
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1153, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1169, in single_request
    return self.parse_response(resp)
  File "/usr/lib/python3.8/xmlrpc/client.py", line 1341, in parse_response
    return u.close()
  File "/usr/lib/python3.8/xmlrpc/client.py", line 655, in close
    raise Fault(**self._stack[0])
xmlrpc.client.Fault: <Fault 1: "<class 'rclpy._rclpy_pybind11.InvalidHandle'>:cannot use Destroyable because destruction was requested">

I have a node which seems to be the cause of this issue, but I'm not totally sure. The node uses zeromq to interface with another system, and receive parameters and a state and it configures another node with those parameters and the state and monitors additionally if the node is online using ros2 node list The node has following structure, some details like the zeromq stuff are left out, but I hope I left enough detail in to be useful:

from rclpy.node import Node

class Server(Node):
    def __init__(self):
        self.name = "my_name"
        super().__init__(self.name + "_client")

    def __enter__(self):
        self.server = create_zermoq_server(callback=self.request_received)
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.server.terminate()

    def request_received(self, msg):
        params, state = parse_msg(msg)
        # I have another lifecycle node which this server controls which is named self.name
        # thus the name of this node is self.name + "_client"
        load_parameter_dict(node=self, node_name=self.name, parameter_dict=params)
        call_change_states(node=self, transitions=state)

def monitor(pub, msg):
    while coordinator.running:
        nodes = subprocess.check_output("ros2 node list", shell=True).decode().split()
        if "/name" in nodes:
            _logger.info("Name is online")
            msg.field = "online"
            pub.publish(msg)
        else:
            _logger.info("Name is offline")
            msg.field = "offline"
            pub.publish(msg)

def stop(signum, frame):
    rclpy.shutdown()
    _logger.info("Signal received")
    coordinator.running = False

def main():
    signal.signal(signal.SIGTERM, stop)
    rclpy.init()
    msg = create_zeromq_msg()
    pub = create_zeromq_pub()
    pub.start()
    threading.Thread(target=monitor, args=(pub, msg)).start()
    with Server() as server:
        _logger.info("Started server")
        while coordinator.running:
            _logger.debug("Running")
            time.sleep(2)
    _logger.info("Closed server")

if __name__ == '__main__':
    main()

Expected behavior

I would expect even if a node somehow fails, to not take down the whole system. The fact that diagnostics like ros2 node list don't work anymore makes this also hard to debug.

Actual behavior

I would expect ros2 node list still to work.

Additional information

I tried following MRE, but it doesn't cause the issue. I guess as intended, because the issues says it's fixed.

fujitatomoya commented 2 years ago

not really sure the root cause of this, but any of the followings could be a work-around?

restart ros2 daemon on the node that has this issue.

ros2 daemon stop
ros2 daemon start
ros2 daemon status

use ros2cli as DirectNode

ros2 node list --no-daemon
ros2 topic list --no-daemon

ivanpauno commented 2 years ago

It seems that the daemon node that ros2 cli uses crashed, and then all the following calls fail. As @fujitatomoya suggests, resetting the daemon should solve the issue.

I'm not sure what's the root cause though, if you find a way to consistently reproduce the issue please share the steps.

Kaju-Bubanja commented 2 years ago

Interesting information, I also had the idea of ros2 deamon, nice to see that it's already implemented. I will try the --no-daemon option next time it happens and try to restart the daemon and see if that helps

Kaju-Bubanja commented 2 years ago

Ok, interestingly, ros2 daemon status shows The daemon is running. Running ros2 node list --no-daemon a few times returns on the first call nothing, then 2/6 nodes, then 6/6 nodes, then nothing again. Restarting the daemon has fixed the issue mentioned above. So it does seem like the daemon somehow crashes, but without it being reported by daemon status. Definitely a bug that should be fixed imo. Unfortunately I still don't know what triggered the bug. It might be the sudden power loss. The computer on which it happened is a mobile device and I always just plug the cable, instead of shutting it down.

fujitatomoya commented 2 years ago

Environment
- ros:rolling container with host network.
- Ubuntu20.04 host
- Intel(R) Core(TM) i7-11700 / Raspberry Pi 4 Model B Rev 1.4 BCM2835 x 2
Setup

start 2 nodes on each machine with remapping node name and topic name as following.

### Intel Machine
root@edgemaster:~# ros2 run demo_nodes_cpp talker --ros-args -r __node:=afoo -r chatter:=abar >& /dev/null &
root@edgemaster:~# ros2 run demo_nodes_cpp talker --ros-args -r __node:=bfoo -r chatter:=bbar >& /dev/null &
### Raspi4-1
root@raspi4-1:/# ros2 run demo_nodes_cpp talker --ros-args -r __node:=cfoo -r chatter:=cbar >& /dev/null &
root@raspi4-1:/# ros2 run demo_nodes_cpp talker --ros-args -r __node:=dfoo -r chatter:=dbar >& /dev/null &
### Raspi4-2
root@raspi4-2:/# ros2 run demo_nodes_cpp talker --ros-args -r __node:=efoo -r chatter:=ebar >& /dev/null &
root@raspi4-2:/# ros2 run demo_nodes_cpp talker --ros-args -r __node:=ffoo -r chatter:=fbar >& /dev/null &

Verification

check see if ros2 node list and ros2 topic list can list all instances with and without --no-daemon in the ros2 network in each machine.

root@edgemaster:~# ros2 node list
/afoo
/bfoo
/cfoo
/dfoo
/efoo
/ffoo
root@edgemaster:~# ros2 topic list
/abar
/bbar
/cbar
/dbar
/ebar
/fbar
/parameter_events
/rosout
root@edgemaster:~# ros2 node list --no-daemon
/afoo
/bfoo
/cfoo
/dfoo
/efoo
/ffoo
root@edgemaster:~# ros2 topic list --no-daemon
/abar
/bbar
/cbar
/dbar
/ebar
/fbar
/parameter_events
/rosout

reported problem cannot be observed in my local environment.

Kaju-Bubanja commented 2 years ago

Maybe the issue is connected with how I start the server. I start it using supervisor with following config:

[program:server]
directory=/home/username/app
command=/bin/bash -c 'source "$0" && exec "$@"' /home/username/ros2_rolling/install/setup.bash /home/username/app/bin/python server.py
environment=HOME="/home/username",FASTRTPS_DEFAULT_PROFILES_FILE="/home/username/ros2_rolling/src/project/config/default.xml",RMW_IMPLEMENTATION="rmw_fastrtps_cpp"
user=username
autostart=true
autorestart=false
startretries=5
stdout_logfile=/var/log/project/server.log
stdout_logfile_maxbytes=10MB
stdout_logfile_backups=10
redirect_stderr=true

Kaju-Bubanja commented 2 years ago

A new kind of error started appearing:

2021-11-30 09:27:34.850 ERROR __main__:server.py:111 failed to create client: rcl node's context is invalid, at /home/username/ros2_rolling/src/ros2/rcl/rcl/src/rcl/node.c:425
Traceback (most recent call last):
  File "server.py", line 99, in configure
    max_tracking_distance = self.set_location_independent_parameters()
  File "server.py", line 80, in set_location_independent_parameters
    load_parameter_dict(node=self, node_name=self.name, parameter_dict=config_dict)
  File "/home/username/ros2_rolling/build/ros2param/ros2param/api/__init__.py", line 120, in load_parameter_dict
    response = call_set_parameters(
  File "/home/username/ros2_rolling/build/ros2param/ros2param/api/__init__.py", line 206, in call_set_parameters
    client = node.create_client(SetParameters, f'{node_name}/set_parameters')
  File "/home/username/ros2_rolling/install/rclpy/lib/python3.8/site-packages/rclpy/node.py", line 1408, in create_client
    client_impl = _rclpy.Client(
rclpy._rclpy_pybind11.RCLError: failed to create client: rcl node's context is invalid, at /home/username/ros2_rolling/src/ros2/rcl/rcl/src/rcl/node.c:425

with this traceback popping up shortly after in the log:

Exception ignored in: <function tail_f_producer.__del__ at 0x7fcea196f550>
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/supervisor/http.py", line 649, in __del__
    self._close()
  File "/usr/lib/python3/dist-packages/supervisor/http.py", line 675, in _close
    self.file.close()
OSError: [Errno 9] Bad file descriptor

While I was writting this bug report my computer just completely froze up, which seems very odd. No mouse movement and no keyboard shortcuts were possible, I couldn't even drop to the terminal. Something in ros's lower levels seems to have caused a complete system halt.

I would really like to get behind what is causing these sporadic random crashes of this node, but I'm not sure where to start.

ros2 / ros2cli