ARM64 architecture under fastdds long-time running failure

rty813 commented 7 months ago

Bug report

Required Info:

Operating System:
- Ubuntu20.04, Arm64, Jetson Orin NX
Installation type:
- Docker ros:humble-ros-base image
Version or commit hash:
- Humble
DDS implementation:
- Fast-RTPS
Client library (if applicable):
- rclpy

Steps to reproduce issue

Writing a simple publisher and subscriber in Python
Increase the sending frequency and message data volume.
ros2 run

Expected behavior

Program is running normally.

Actual behavior

Four hours later, the listener node no longer prints logs, and messages cannot be received through ros2 topic echo topic either.

After restarting the talker, everything returns to normal. If I use CycloneDDS, there are no problems either.

Additional information

When running normally, data packets can be captured with tcpdump. cmdline: tcpdump -i any -X udp portrange 7401-7500

10:14:15.294838 lo    In  IP localhost.52382 > localhost.7413: UDP, length 1188
        0x0000:  4500 04c0 abeb 4000 4011 8c3f 7f00 0001  E.....@.@..?....
        0x0010:  7f00 0001 cc9e 1cf5 04ac 02c0 5254 5053  ............RTPS
        0x0020:  0203 010f 010f 6b24 f908 8385 0100 0000  ......k$........
        0x0030:  0e01 0c00 010f 7ec0 8503 b3d0 0100 0000  ......~.........
        0x0040:  0901 0800 f780 0366 ce36 744b 1505 7004  .......f.6tK..p.
        0x0050:  0000 1000 0000 1104 0000 1103 0000 0000  ................
        0x0060:  5701 0000 0001 0000 5204 0000 4865 6c6c  W.......R...Hell
        0x0070:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x0080:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x0090:  6c6f 2057 6f72 6c64 4865 6c6c 6f20 576f  lo.WorldHello.Wo
        0x00a0:  726c 6448 656c 6c6f 2057 6f72 6c64 4865  rldHello.WorldHe
        0x00b0:  6c6c 6f20 576f 726c 6448 656c 6c6f 2057  llo.WorldHello.W
        0x00c0:  6f72 6c64 4865 6c6c 6f20 576f 726c 6448  orldHello.WorldH
        0x00d0:  656c 6c6f 2057 6f72 6c64 4865 6c6c 6f20  ello.WorldHello.
        0x00e0:  576f 726c 6448 656c 6c6f 2057 6f72 6c64  WorldHello.World
        0x00f0:  4865 6c6c 6f20 576f 726c 6448 656c 6c6f  Hello.WorldHello
        0x0100:  2057 6f72 6c64 4865 6c6c 6f20 576f 726c  .WorldHello.Worl
        0x0110:  6448 656c 6c6f 2057 6f72 6c64 4865 6c6c  dHello.WorldHell
        0x0120:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x0130:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x0140:  6c6f 2057 6f72 6c64 4865 6c6c 6f20 576f  lo.WorldHello.Wo
        0x0150:  726c 6448 656c 6c6f 2057 6f72 6c64 4865  rldHello.WorldHe
        0x0160:  6c6c 6f20 576f 726c 6448 656c 6c6f 2057  llo.WorldHello.W
        0x0170:  6f72 6c64 4865 6c6c 6f20 576f 726c 6448  orldHello.WorldH
        0x0180:  656c 6c6f 2057 6f72 6c64 4865 6c6c 6f20  ello.WorldHello.
        0x0190:  576f 726c 6448 656c 6c6f 2057 6f72 6c64  WorldHello.World
        0x01a0:  4865 6c6c 6f20 576f 726c 6448 656c 6c6f  Hello.WorldHello
        0x01b0:  2057 6f72 6c64 4865 6c6c 6f20 576f 726c  .WorldHello.Worl
        0x01c0:  6448 656c 6c6f 2057 6f72 6c64 4865 6c6c  dHello.WorldHell
        0x01d0:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x01e0:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x01f0:  6c6f 2057 6f72 6c64 4865 6c6c 6f20 576f  lo.WorldHello.Wo
        0x0200:  726c 6448 656c 6c6f 2057 6f72 6c64 4865  rldHello.WorldHe
        0x0210:  6c6c 6f20 576f 726c 6448 656c 6c6f 2057  llo.WorldHello.W
        0x0220:  6f72 6c64 4865 6c6c 6f20 576f 726c 6448  orldHello.WorldH
        0x0230:  656c 6c6f 2057 6f72 6c64 4865 6c6c 6f20  ello.WorldHello.
        0x0240:  576f 726c 6448 656c 6c6f 2057 6f72 6c64  WorldHello.World
        0x0250:  4865 6c6c 6f20 576f 726c 6448 656c 6c6f  Hello.WorldHello
        0x0260:  2057 6f72 6c64 4865 6c6c 6f20 576f 726c  .WorldHello.Worl
        0x0270:  6448 656c 6c6f 2057 6f72 6c64 4865 6c6c  dHello.WorldHell
        0x0280:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x0290:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x02a0:  6c6f 2057 6f72 6c64 4865 6c6c 6f20 576f  lo.WorldHello.Wo
        0x02b0:  726c 6448 656c 6c6f 2057 6f72 6c64 4865  rldHello.WorldHe
        0x02c0:  6c6c 6f20 576f 726c 6448 656c 6c6f 2057  llo.WorldHello.W
        0x02d0:  6f72 6c64 4865 6c6c 6f20 576f 726c 6448  orldHello.WorldH
        0x02e0:  656c 6c6f 2057 6f72 6c64 4865 6c6c 6f20  ello.WorldHello.
        0x02f0:  576f 726c 6448 656c 6c6f 2057 6f72 6c64  WorldHello.World
        0x0300:  4865 6c6c 6f20 576f 726c 6448 656c 6c6f  Hello.WorldHello
        0x0310:  2057 6f72 6c64 4865 6c6c 6f20 576f 726c  .WorldHello.Worl
        0x0320:  6448 656c 6c6f 2057 6f72 6c64 4865 6c6c  dHello.WorldHell
        0x0330:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x0340:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x0350:  6c6f 2057 6f72 6c64 4865 6c6c 6f20 576f  lo.WorldHello.Wo
        0x0360:  726c 6448 656c 6c6f 2057 6f72 6c64 4865  rldHello.WorldHe
        0x0370:  6c6c 6f20 576f 726c 6448 656c 6c6f 2057  llo.WorldHello.W
        0x0380:  6f72 6c64 4865 6c6c 6f20 576f 726c 6448  orldHello.WorldH
        0x0390:  656c 6c6f 2057 6f72 6c64 4865 6c6c 6f20  ello.WorldHello.
        0x03a0:  576f 726c 6448 656c 6c6f 2057 6f72 6c64  WorldHello.World
        0x03b0:  4865 6c6c 6f20 576f 726c 6448 656c 6c6f  Hello.WorldHello
        0x03c0:  2057 6f72 6c64 4865 6c6c 6f20 576f 726c  .WorldHello.Worl
        0x03d0:  6448 656c 6c6f 2057 6f72 6c64 4865 6c6c  dHello.WorldHell
        0x03e0:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x03f0:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x0400:  6c6f 2057 6f72 6c64 4865 6c6c 6f20 576f  lo.WorldHello.Wo
        0x0410:  726c 6448 656c 6c6f 2057 6f72 6c64 4865  rldHello.WorldHe
        0x0420:  6c6c 6f20 576f 726c 6448 656c 6c6f 2057  llo.WorldHello.W
        0x0430:  6f72 6c64 4865 6c6c 6f20 576f 726c 6448  orldHello.WorldH
        0x0440:  656c 6c6f 2057 6f72 6c64 4865 6c6c 6f20  ello.WorldHello.
        0x0450:  576f 726c 6448 656c 6c6f 2057 6f72 6c64  WorldHello.World
        0x0460:  4865 6c6c 6f20 576f 726c 6448 656c 6c6f  Hello.WorldHello
        0x0470:  2057 6f72 6c64 4865 6c6c 6f20 576f 726c  .WorldHello.Worl
        0x0480:  6448 656c 6c6f 2057 6f72 6c64 4865 6c6c  dHello.WorldHell
        0x0490:  6f20 576f 726c 6448 656c 6c6f 2057 6f72  o.WorldHello.Wor
        0x04a0:  6c64 4865 6c6c 6f20 576f 726c 6448 656c  ldHello.WorldHel
        0x04b0:  6c6f 2057 6f72 6c64 3a20 3334 3200 0000  lo.World:.342...

After the talker crashes, tcpdump cannot capture any packets.

10:06:10.542177 lo    In  IP localhost.44357 > localhost.7412: UDP, length 420
        0x0000:  4500 01c0 c07e 4000 4011 7aac 7f00 0001  E....~@.@.z.....
        0x0010:  7f00 0001 ad45 1cf4 01ac ffbf 5254 5053  .....E......RTPS
        0x0020:  0203 010f 010f 7ec0 a103 dd31 0100 0000  ......~....1....
        0x0030:  0901 0800 f79e 0266 7f6a fe81 1505 8001  .......f.j......
        0x0040:  0000 1000 0001 00c7 0001 00c2 0000 0000  ................
        0x0050:  0100 0000 0003 0000 1500 0400 0203 0000  ................
        0x0060:  1600 0400 010f 0000 5000 1000 010f 7ec0  ........P.....~.
        0x0070:  a103 dd31 0100 0000 0000 01c1 3200 1800  ...1........2...
        0x0080:  0100 0000 f61c 0000 0000 0000 0000 0000  ................
        0x0090:  0000 0000 c0a8 0370 3200 1800 0100 0000  .......p2.......
        0x00a0:  f61c 0000 0000 0000 0000 0000 0000 0000  ................
        0x00b0:  0a4c 51ba 3200 1800 0100 0000 f61c 0000  .LQ.2...........
        0x00c0:  0000 0000 0000 0000 0000 0000 ac12 0001  ................
        0x00d0:  3200 1800 0100 0000 f61c 0000 0000 0000  2...............
        0x00e0:  0000 0000 0000 0000 0a01 79da 3100 1800  ..........y.1...
        0x00f0:  1000 0000 f71c 0000 557e c000 0000 0000  ........U~......
        0x0100:  0000 0000 0000 0000 3100 1800 0100 0000  ........1.......
        0x0110:  f71c 0000 0000 0000 0000 0000 0000 0000  ................
        0x0120:  c0a8 0370 3100 1800 0100 0000 f71c 0000  ...p1...........
        0x0130:  0000 0000 0000 0000 0000 0000 0a4c 51ba  .............LQ.
        0x0140:  3100 1800 0100 0000 f71c 0000 0000 0000  1...............
        0x0150:  0000 0000 0000 0000 ac12 0001 0200 0800  ................
        0x0160:  1400 0000 0000 0000 5800 0400 3f0c 0f00  ........X...?...
        0x0170:  6200 0800 0200 0000 2f00 0000 2c00 1000  b......./...,...
        0x0180:  0b00 0000 656e 636c 6176 653d 2f3b 0000  ....enclave=/;..
        0x0190:  5900 2800 0100 0000 1100 0000 5041 5254  Y.(.........PART
        0x01a0:  4943 4950 414e 545f 5459 5045 0000 0000  ICIPANT_TYPE....
        0x01b0:  0700 0000 5349 4d50 4c45 0000 0100 0000  ....SIMPLE......
10:06:10.890641 lo    In  IP localhost.49475 > localhost.7414: UDP, length 420
        0x0000:  4500 01c0 c0c2 4000 4011 7a68 7f00 0001  E.....@.@.zh....
        0x0010:  7f00 0001 c143 1cf6 01ac ffbf 5254 5053  .....C......RTPS
        0x0020:  0203 010f 010f 7ec0 8503 b3d0 0100 0000  ......~.........
        0x0030:  0901 0800 f19e 0266 7fce 47c5 1505 8001  .......f..G.....
        0x0040:  0000 1000 0001 00c7 0001 00c2 0000 0000  ................
        0x0050:  0100 0000 0003 0000 1500 0400 0203 0000  ................
        0x0060:  1600 0400 010f 0000 5000 1000 010f 7ec0  ........P.....~.
        0x0070:  8503 b3d0 0100 0000 0000 01c1 3200 1800  ............2...
        0x0080:  0100 0000 f41c 0000 0000 0000 0000 0000  ................
        0x0090:  0000 0000 c0a8 0370 3200 1800 0100 0000  .......p2.......
        0x00a0:  f41c 0000 0000 0000 0000 0000 0000 0000  ................
        0x00b0:  0a4c 51ba 3200 1800 0100 0000 f41c 0000  .LQ.2...........
        0x00c0:  0000 0000 0000 0000 0000 0000 ac12 0001  ................
        0x00d0:  3200 1800 0100 0000 f41c 0000 0000 0000  2...............
        0x00e0:  0000 0000 0000 0000 0a01 79da 3100 1800  ..........y.1...
        0x00f0:  1000 0000 f51c 0000 557e c000 0000 0000  ........U~......
        0x0100:  0000 0000 0000 0000 3100 1800 0100 0000  ........1.......
        0x0110:  f51c 0000 0000 0000 0000 0000 0000 0000  ................
        0x0120:  c0a8 0370 3100 1800 0100 0000 f51c 0000  ...p1...........
        0x0130:  0000 0000 0000 0000 0000 0000 0a4c 51ba  .............LQ.
        0x0140:  3100 1800 0100 0000 f51c 0000 0000 0000  1...............
        0x0150:  0000 0000 0000 0000 ac12 0001 0200 0800  ................
        0x0160:  1400 0000 0000 0000 5800 0400 3f0c 0f00  ........X...?...
        0x0170:  6200 0800 0200 0000 2f00 0000 2c00 1000  b......./...,...
        0x0180:  0b00 0000 656e 636c 6176 653d 2f3b 0000  ....enclave=/;..
        0x0190:  5900 2800 0100 0000 1100 0000 5041 5254  Y.(.........PART
        0x01a0:  4943 4950 414e 545f 5459 5045 0000 0000  ICIPANT_TYPE....
        0x01b0:  0700 0000 5349 4d50 4c45 0000 0100 0000  ....SIMPLE......

talker.py:

import rclpy
from rclpy.node import Node

from std_msgs.msg import String

class MinimalPublisher(Node):

    def __init__(self):
        super().__init__('minimal_publisher')
        self.publisher_ = self.create_publisher(String, 'topic', 10)
        timer_period = 0.1  # seconds
        self.timer = self.create_timer(timer_period, self.timer_callback)
        self.i = 0

    def timer_callback(self):
        msg = String()
        msg.data = 'Hello World' * 100 + ': %d' % self.i
        self.publisher_.publish(msg)
        self.i += 1
        if self.i % 50 == 0:
            self.get_logger().info('Publishing: "%d"' % self.i)

def main(args=None):
    rclpy.init(args=args)

    minimal_publisher = MinimalPublisher()

    rclpy.spin(minimal_publisher)

    # Destroy the node explicitly
    # (optional - otherwise it will be done automatically
    # when the garbage collector destroys the node object)
    minimal_publisher.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Listener.py:

import rclpy
from rclpy.node import Node

from std_msgs.msg import String

class MinimalSubscriber(Node):

    def __init__(self):
        super().__init__('minimal_subscriber')
        self.subscription = self.create_subscription(
            String,
            'topic',
            self.listener_callback,
            10)
        self.subscription  # prevent unused variable warning
        self.i = 0

    def listener_callback(self, msg):
        self.i += 1
        if self.i % 50 == 0:
            self.get_logger().info('I heard: "%s"' % msg.data.split(':')[-1])

def main(args=None):
    rclpy.init(args=args)

    minimal_subscriber = MinimalSubscriber()

    rclpy.spin(minimal_subscriber)

    # Destroy the node explicitly
    # (optional - otherwise it will be done automatically
    # when the garbage collector destroys the node object)
    minimal_subscriber.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

fujitatomoya commented 7 months ago

how about memory usage on subscription process, i would check the statistics for the process space. besides, if publisher and subscription are in the same host, it uses shared memory transport and data sharing, it would be nice to check the file system /dev/shm?

rty813 commented 7 months ago

root@cn12 /h/orca# ll /dev/shm
total 0

They are in the same host, but I can capture the packet by tcpdump, Does this imply that they are not transferring data through shared memory between them?

I will further investigate the condition of the memory before the talker crashes.

rty813 commented 7 months ago

@fujitatomoya When the program crashes, there is not much change in memory, with available memory always above 5GB. Besides, I checked /var/log/syslog and found that every time there is an issue, the following messages are logged. What does this indicate?

Mar 28 09:36:29 cn12 systemd[1]: Stopping User Manager for UID 1000...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Main User Target.
Mar 28 09:36:29 cn12 gvfsd[2708479]: A connection to the bus can't be made
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping D-Bus User Message Bus...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - Apple File Conduit monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - GNOME Online Accounts monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - digital camera monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - Media Transfer Protocol monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - disk device monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Tracker file system data miner...
Mar 28 09:36:29 cn12 tracker-miner-fs[2708449]: Received signal:15->'Terminated'
Mar 28 09:36:29 cn12 tracker-miner-f[2708449]: Error while sending AddMatch() message: The connection is closed
Mar 28 09:36:29 cn12 tracker-miner-f[2708449]: message repeated 2 times: [ Error while sending AddMatch() message: The connection is closed]
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-udisks2-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - disk device monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: dbus.service: Killing process 2708602 (gdbus) with signal SIGKILL.
Mar 28 09:36:29 cn12 systemd[1]: run-user-1000-gvfs.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[3730]: run-user-1000-gvfs.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: run-user-1000-gvfs.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-mtp-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - Media Transfer Protocol monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-afc-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - Apple File Conduit monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-gphoto2-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - digital camera monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-goa-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - GNOME Online Accounts monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-daemon.service: Succeeded.
Mar 28 09:36:29 cn12 tracker-miner-fs[2708449]: OK
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service.
Mar 28 09:36:29 cn12 systemd[2708381]: dbus.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped D-Bus User Message Bus.
Mar 28 09:36:29 cn12 systemd[2708381]: tracker-miner-fs.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Tracker file system data miner.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Basic System.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Paths.
Mar 28 09:36:29 cn12 systemd[2708381]: ubuntu-report.path: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Pending report trigger for Ubuntu Report.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Sockets.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Timers.
Mar 28 09:36:29 cn12 systemd[2708381]: dbus.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed D-Bus User Message Bus Socket.
Mar 28 09:36:29 cn12 systemd[2708381]: dirmngr.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG network certificate management daemon.
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent-browser.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent-extra.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent-ssh.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent and passphrase cache.
Mar 28 09:36:29 cn12 systemd[2708381]: pk-debconf-helper.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed debconf communication socket.
Mar 28 09:36:29 cn12 systemd[2708381]: pulseaudio.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed Sound System.
Mar 28 09:36:29 cn12 systemd[2708381]: snapd.session-agent.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed REST API socket for snapd user session agent.
Mar 28 09:36:29 cn12 systemd[2708381]: Reached target Shutdown.
Mar 28 09:36:29 cn12 systemd[2708381]: systemd-exit.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Finished Exit the Session.
Mar 28 09:36:29 cn12 systemd[2708381]: Reached target Exit the Session.
Mar 28 09:36:29 cn12 systemd[1]: user@1000.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: Stopped User Manager for UID 1000.
Mar 28 09:36:29 cn12 systemd[1]: Stopping User Runtime Directory /run/user/1000...
Mar 28 09:36:29 cn12 systemd[3730]: run-user-1000.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: run-user-1000.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: user-runtime-dir@1000.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: Stopped User Runtime Directory /run/user/1000.
Mar 28 09:36:29 cn12 systemd[1]: Removed slice User Slice of UID 1000.

fujitatomoya commented 7 months ago

When the program crashes

that is new, so the listener crash and you have core file? it would be easier to see the stack trace where exactly it crashes? (i thought that listener process space is still on proc, but no callback is called...)

every time there is an issue, the following messages are logged

I am not sure. do you have systemd service with the user? user logged out (timeout?), and that stops the system services and leads to this situation? just guessing, i am not even sure, sorry.

I am expecting that you use 2 containers, one for publisher and the other for subscription? binding the different IP addresses from docker bridge, and assign different namespace, that does not allow them to use shared memory. i would try if the problem happens with host system, and single container. that could tell where the problem could sit?

sorry i do not have direct solution, but some ideas i would try.

rty813 commented 7 months ago

that is new, so the listener crash and you have core file?

Sorry, my wording was not accurate. The program did not crash; there is no core file.

I am expecting that you use 2 containers

I deployed both the listener and the talker inside the same container.

I guess that after the user logged out, some kind of timeout occurred after a while, causing a certain service to shut down, which then resulted in some sort of exception being thrown. But I didn't see any special service in the list.

fujitatomoya commented 7 months ago

I deployed both the listener and the talker inside the same container.

that brings me an another question, why shared memory transport is used... do you apply fast-dds configuration file?

rty813 commented 7 months ago

I did not use a custom configuration.However, I'm not too concerned about this issue. I still want to know why the talker isn't working after a user logout. Are there any tools or methods for troubleshooting?

fujitatomoya commented 7 months ago

To be honest, i really do not know.

Something i would try is,

enable log level debug for talker and listener, and see if anything happens on those nodes.
take core file via gcore and debug stack trace and threads.

rty813 commented 7 months ago

I have found a reproducible path, whereby executing systemctl stop user-1000.slice as the root account will trigger this issue after several seconds later.

Additionally, the talker and listener are managed through the supervisor tool installed within the container.

I tried a different approach by running it with nohup, then disown the process, but executing systemctl stop user-1000.slice still causes the same issue. Hence, supervisor should not be the cause.

rty813 commented 7 months ago

From the debug logs, it appears that after the issue arises, the listener stops producing any log output, and there are no error messages. The talker continues to log output, and there are no error messages either.

......   # -----------------------listener log
[DEBUG] [1711705704.364687047] [rcl]: Subscription taking message
[DEBUG] [1711705704.364782793] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.365295022] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.365721395] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.464645330] [rcl]: Subscription taking message
[DEBUG] [1711705704.464741395] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.465181944] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.465583805] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.564610941] [rcl]: Subscription taking message
[DEBUG] [1711705704.564781983] [rcl]: Subscription take succeeded: true
[INFO] [1711705704.566888695] [minimal_subscriber]: I heard: " 199"
[DEBUG] [1711705704.567356892] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.567741281] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.664690761] [rcl]: Subscription taking message
[DEBUG] [1711705704.664784426] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.665222575] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.665615156] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.764817174] [rcl]: Subscription taking message
[DEBUG] [1711705704.764918967] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.765479166] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.765905922] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.864635999] [rcl]: Subscription taking message
[DEBUG] [1711705704.864724928] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.865188102] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.865633739] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.964703020] [rcl]: Subscription taking message
[DEBUG] [1711705704.964800749] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.965327187] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.965733815] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.064697176] [rcl]: Subscription taking message
[DEBUG] [1711705705.064806490] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705705.065299231] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.065718276] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.164746375] [rcl]: Subscription taking message
[DEBUG] [1711705705.164846696] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705705.165333038] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.165756115] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.264637588] [rcl]: Subscription taking message
[DEBUG] [1711705705.264741685] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705705.265220155] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.265621727] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services          
# --------------------- log stopped here ------------------

rty813 commented 7 months ago

I created a new user with ID 1001 inside the container, then used this user to start the ROS script and put it in the background. After that, when I executed the command systemctl stop user-1000.slice, everything worked fine. It seems to be related to mechanisms such as cgroup and systemd.

fujitatomoya commented 7 months ago

but executing systemctl stop user-1000.slice still causes the same issue.

that stops user slice, so it affects everything under this user id. and then removes user's runtime directory /run/user/UID when this unit is stopped.

maybe systemd units related, vendor SoC system could have customized user and system setting. that could be a reason, i am not so sure.

rty813 commented 7 months ago

I don't know why, but today during testing, I found that it started transmitting data through shm, the shared memory files exist under /dev/shm. In this situation, after executing systemctl stop user-1000.slice, the shm files under /dev/shm were all deleted. Through auditd audit logs, it was found that the talker and listener themselves deleted the shared memory files.

root@cn12 /h/orca [0|1]# ausearch -k shm_monitor | aureport -s

Syscall Report
=======================================
# date time syscall pid comm auid event
=======================================
1. 03/30/24 10:00:47 56 3701507 touch -1 22
2. 03/30/24 10:01:00 56 3705456 listener -1 33
3. 03/30/24 10:01:00 52 3705456 listener -1 34
4. 03/30/24 10:01:00 35 3705456 listener -1 35
5. 03/30/24 10:01:00 35 3705456 listener -1 36
6. 03/30/24 10:01:00 56 3705456 listener -1 37
7. 03/30/24 10:01:00 52 3705456 listener -1 38
8. 03/30/24 10:01:00 56 3705456 listener -1 39
9. 03/30/24 10:01:00 37 3705456 listener -1 40
10. 03/30/24 10:01:00 35 3705456 listener -1 41
11. 03/30/24 10:01:00 35 3705456 listener -1 42
12. 03/30/24 10:01:00 35 3705456 listener -1 43
13. 03/30/24 10:01:00 56 3705456 listener -1 44
14. 03/30/24 10:01:00 52 3705456 listener -1 45
15. 03/30/24 10:01:01 56 3705821 talker -1 58
16. 03/30/24 10:01:01 52 3705821 talker -1 59
17. 03/30/24 10:01:01 35 3705821 talker -1 60
18. 03/30/24 10:01:01 35 3705821 talker -1 61
19. 03/30/24 10:01:01 56 3705821 talker -1 62
20. 03/30/24 10:01:01 52 3705821 talker -1 63
21. 03/30/24 10:01:01 56 3705821 talker -1 64
22. 03/30/24 10:01:01 37 3705821 talker -1 65
23. 03/30/24 10:01:01 35 3705821 talker -1 66
24. 03/30/24 10:01:01 56 3705821 talker -1 67
25. 03/30/24 10:01:01 35 3705821 talker -1 68
26. 03/30/24 10:01:01 56 3705821 talker -1 69
27. 03/30/24 10:01:01 56 3705821 talker -1 70
28. 03/30/24 10:01:01 37 3705821 talker -1 71
29. 03/30/24 10:01:01 35 3705821 talker -1 72
30. 03/30/24 10:01:01 35 3705821 talker -1 73
31. 03/30/24 10:01:01 35 3705821 talker -1 74
32. 03/30/24 10:01:01 56 3705821 talker -1 75
33. 03/30/24 10:01:01 52 3705821 talker -1 76
34. 03/30/24 10:01:01 56 3705821 talker -1 77
35. 03/30/24 10:01:01 37 3705821 talker -1 78
36. 03/30/24 10:01:01 35 3705821 talker -1 79
37. 03/30/24 10:01:01 56 3705821 talker -1 80
38. 03/30/24 10:01:01 35 3705821 talker -1 81
39. 03/30/24 10:01:01 56 3705821 talker -1 82
40. 03/30/24 10:01:01 56 3705456 listener -1 83
41. 03/30/24 10:01:01 56 3705456 listener -1 84
42. 03/30/24 10:01:01 37 3705456 listener -1 85
43. 03/30/24 10:01:01 35 3705456 listener -1 86
44. 03/30/24 10:01:01 56 3705456 listener -1 87
45. 03/30/24 10:01:01 35 3705456 listener -1 88
46. 03/30/24 10:01:01 56 3705456 listener -1 89
47. 03/30/24 10:01:01 56 3705821 talker -1 90
48. 03/30/24 10:01:57 35 3705456 listener -1 107    # <--------------------- 35 means "unlink"
49. 03/30/24 10:01:57 35 3705456 listener -1 108
50. 03/30/24 10:01:57 35 3705821 talker -1 109
51. 03/30/24 10:01:57 35 3705821 talker -1 110

This is one of the detailed logs of ausearch.

time->Sat Mar 30 10:01:57 2024
type=PROCTITLE msg=audit(1711764117.836:109): proctitle=2F7573722F62696E2F707974686F6E33002F686F6D652F6F7263612F77732F696E7374616C6C2F70795F7075627375622F6C69622F70795F7075627375622F74616C6B6572
type=PATH msg=audit(1711764117.836:109): item=1 name="/dev/shm/fastrtps_261c7b3421ee9950_el" inode=26424 dev=00:19 mode=0100644 ouid=1000 ogid=1000 rdev=00:00 obj=unlabeled nametype=DELETE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1711764117.836:109): item=0 name="/dev/shm/" inode=1 dev=00:19 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=unlabeled nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1711764117.836:109): cwd="/home/orca"
type=SYSCALL msg=audit(1711764117.836:109): arch=c00000b7 syscall=35 success=yes exit=0 a0=ffffffffffffff9c a1=ffff88000c40 a2=0 a3=25 items=2 ppid=3705454 pid=3705821 auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=(none) ses=4294967295 comm="talker" exe="/usr/bin/python3.10" subj=kernel key="shm_monitor"

So, is it possible that after stopping user-1000.slice, the processes received some kind of signals, resulting in the shared memory being closed? However, I didn't see any logs regarding this.

rty813 commented 7 months ago

I conducted multiple experiments again and deconstructed the systemctl stop user-1000.slice command.

First, I changed the StopWhenUnneeded setting in the user-1000.slice configuration to no to ensure that the service would not stop automatically.

Then I executed systemctl stop user@1000.service and umount /run/user/1000 in sequence, and the program continued to run normally.

And then, I logged out of SSH and logged back in immediately, and the program was still running normally.

However, if I log out of SSH for a certain period of time, the program stops running. At the same time, the user-1000.slice is in an active state. I have no idea how this SSH session is related to ROS running.

rty813 commented 2 months ago

I solved it by command loginctl enable-linger orca

ros2 / rmw_fastrtps