Closed rty813 closed 2 months ago
how about memory usage on subscription process, i would check the statistics for the process space. besides, if publisher and subscription are in the same host, it uses shared memory transport
and data sharing
, it would be nice to check the file system /dev/shm
?
root@cn12 /h/orca# ll /dev/shm
total 0
They are in the same host, but I can capture the packet by tcpdump, Does this imply that they are not transferring data through shared memory between them?
I will further investigate the condition of the memory before the talker crashes.
@fujitatomoya When the program crashes, there is not much change in memory, with available memory always above 5GB. Besides, I checked /var/log/syslog and found that every time there is an issue, the following messages are logged. What does this indicate?
Mar 28 09:36:29 cn12 systemd[1]: Stopping User Manager for UID 1000...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Main User Target.
Mar 28 09:36:29 cn12 gvfsd[2708479]: A connection to the bus can't be made
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping D-Bus User Message Bus...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - Apple File Conduit monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - GNOME Online Accounts monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - digital camera monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - Media Transfer Protocol monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Virtual filesystem service - disk device monitor...
Mar 28 09:36:29 cn12 systemd[2708381]: Stopping Tracker file system data miner...
Mar 28 09:36:29 cn12 tracker-miner-fs[2708449]: Received signal:15->'Terminated'
Mar 28 09:36:29 cn12 tracker-miner-f[2708449]: Error while sending AddMatch() message: The connection is closed
Mar 28 09:36:29 cn12 tracker-miner-f[2708449]: message repeated 2 times: [ Error while sending AddMatch() message: The connection is closed]
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-udisks2-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - disk device monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: dbus.service: Killing process 2708602 (gdbus) with signal SIGKILL.
Mar 28 09:36:29 cn12 systemd[1]: run-user-1000-gvfs.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[3730]: run-user-1000-gvfs.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: run-user-1000-gvfs.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-mtp-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - Media Transfer Protocol monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-afc-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - Apple File Conduit monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-gphoto2-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - digital camera monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-goa-volume-monitor.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service - GNOME Online Accounts monitor.
Mar 28 09:36:29 cn12 systemd[2708381]: gvfs-daemon.service: Succeeded.
Mar 28 09:36:29 cn12 tracker-miner-fs[2708449]: OK
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Virtual filesystem service.
Mar 28 09:36:29 cn12 systemd[2708381]: dbus.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped D-Bus User Message Bus.
Mar 28 09:36:29 cn12 systemd[2708381]: tracker-miner-fs.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Tracker file system data miner.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Basic System.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Paths.
Mar 28 09:36:29 cn12 systemd[2708381]: ubuntu-report.path: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped Pending report trigger for Ubuntu Report.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Sockets.
Mar 28 09:36:29 cn12 systemd[2708381]: Stopped target Timers.
Mar 28 09:36:29 cn12 systemd[2708381]: dbus.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed D-Bus User Message Bus Socket.
Mar 28 09:36:29 cn12 systemd[2708381]: dirmngr.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG network certificate management daemon.
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent-browser.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent-extra.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent-ssh.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent (ssh-agent emulation).
Mar 28 09:36:29 cn12 systemd[2708381]: gpg-agent.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed GnuPG cryptographic agent and passphrase cache.
Mar 28 09:36:29 cn12 systemd[2708381]: pk-debconf-helper.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed debconf communication socket.
Mar 28 09:36:29 cn12 systemd[2708381]: pulseaudio.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed Sound System.
Mar 28 09:36:29 cn12 systemd[2708381]: snapd.session-agent.socket: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Closed REST API socket for snapd user session agent.
Mar 28 09:36:29 cn12 systemd[2708381]: Reached target Shutdown.
Mar 28 09:36:29 cn12 systemd[2708381]: systemd-exit.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[2708381]: Finished Exit the Session.
Mar 28 09:36:29 cn12 systemd[2708381]: Reached target Exit the Session.
Mar 28 09:36:29 cn12 systemd[1]: user@1000.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: Stopped User Manager for UID 1000.
Mar 28 09:36:29 cn12 systemd[1]: Stopping User Runtime Directory /run/user/1000...
Mar 28 09:36:29 cn12 systemd[3730]: run-user-1000.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: run-user-1000.mount: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: user-runtime-dir@1000.service: Succeeded.
Mar 28 09:36:29 cn12 systemd[1]: Stopped User Runtime Directory /run/user/1000.
Mar 28 09:36:29 cn12 systemd[1]: Removed slice User Slice of UID 1000.
When the program crashes
that is new, so the listener crash and you have core file? it would be easier to see the stack trace where exactly it crashes? (i thought that listener process space is still on proc, but no callback is called...)
every time there is an issue, the following messages are logged
I am not sure. do you have systemd service with the user? user logged out (timeout?), and that stops the system services and leads to this situation? just guessing, i am not even sure, sorry.
I am expecting that you use 2 containers, one for publisher and the other for subscription? binding the different IP addresses from docker bridge, and assign different namespace, that does not allow them to use shared memory. i would try if the problem happens with host system
, and single container
. that could tell where the problem could sit?
sorry i do not have direct solution, but some ideas i would try.
that is new, so the listener crash and you have core file?
Sorry, my wording was not accurate. The program did not crash; there is no core file.
I am expecting that you use 2 containers
I deployed both the listener and the talker inside the same container.
I guess that after the user logged out, some kind of timeout occurred after a while, causing a certain service to shut down, which then resulted in some sort of exception being thrown. But I didn't see any special service in the list.
I deployed both the listener and the talker inside the same container.
that brings me an another question, why shared memory transport is used... do you apply fast-dds configuration file?
I did not use a custom configuration.However, I'm not too concerned about this issue. I still want to know why the talker isn't working after a user logout. Are there any tools or methods for troubleshooting?
To be honest, i really do not know.
Something i would try is,
gcore
and debug stack trace and threads.I have found a reproducible path, whereby executing systemctl stop user-1000.slice
as the root account will trigger this issue after several seconds later.
Additionally, the talker and listener are managed through the supervisor tool installed within the container.
I tried a different approach by running it with nohup
, then disown
the process, but executing systemctl stop user-1000.slice
still causes the same issue. Hence, supervisor should not be the cause.
From the debug logs, it appears that after the issue arises, the listener stops producing any log output, and there are no error messages. The talker continues to log output, and there are no error messages either.
...... # -----------------------listener log
[DEBUG] [1711705704.364687047] [rcl]: Subscription taking message
[DEBUG] [1711705704.364782793] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.365295022] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.365721395] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.464645330] [rcl]: Subscription taking message
[DEBUG] [1711705704.464741395] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.465181944] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.465583805] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.564610941] [rcl]: Subscription taking message
[DEBUG] [1711705704.564781983] [rcl]: Subscription take succeeded: true
[INFO] [1711705704.566888695] [minimal_subscriber]: I heard: " 199"
[DEBUG] [1711705704.567356892] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.567741281] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.664690761] [rcl]: Subscription taking message
[DEBUG] [1711705704.664784426] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.665222575] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.665615156] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.764817174] [rcl]: Subscription taking message
[DEBUG] [1711705704.764918967] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.765479166] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.765905922] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.864635999] [rcl]: Subscription taking message
[DEBUG] [1711705704.864724928] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.865188102] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.865633739] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.964703020] [rcl]: Subscription taking message
[DEBUG] [1711705704.964800749] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705704.965327187] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705704.965733815] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.064697176] [rcl]: Subscription taking message
[DEBUG] [1711705705.064806490] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705705.065299231] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.065718276] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.164746375] [rcl]: Subscription taking message
[DEBUG] [1711705705.164846696] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705705.165333038] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.165756115] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.264637588] [rcl]: Subscription taking message
[DEBUG] [1711705705.264741685] [rcl]: Subscription take succeeded: true
[DEBUG] [1711705705.265220155] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
[DEBUG] [1711705705.265621727] [rcl]: Initializing wait set with '1' subscriptions, '2' guard conditions, '0' timers, '0' clients, '6' services
# --------------------- log stopped here ------------------
I created a new user with ID 1001 inside the container, then used this user to start the ROS script and put it in the background. After that, when I executed the command systemctl stop user-1000.slice
, everything worked fine. It seems to be related to mechanisms such as cgroup and systemd.
but executing systemctl stop user-1000.slice still causes the same issue.
that stops user slice, so it affects everything under this user id. and then removes user's runtime directory /run/user/UID when this unit is stopped.
maybe systemd units related, vendor SoC system could have customized user and system setting. that could be a reason, i am not so sure.
I don't know why, but today during testing, I found that it started transmitting data through shm, the shared memory files exist under /dev/shm. In this situation, after executing systemctl stop user-1000.slice
, the shm files under /dev/shm were all deleted. Through auditd audit logs, it was found that the talker and listener themselves deleted the shared memory files.
root@cn12 /h/orca [0|1]# ausearch -k shm_monitor | aureport -s
Syscall Report
=======================================
# date time syscall pid comm auid event
=======================================
1. 03/30/24 10:00:47 56 3701507 touch -1 22
2. 03/30/24 10:01:00 56 3705456 listener -1 33
3. 03/30/24 10:01:00 52 3705456 listener -1 34
4. 03/30/24 10:01:00 35 3705456 listener -1 35
5. 03/30/24 10:01:00 35 3705456 listener -1 36
6. 03/30/24 10:01:00 56 3705456 listener -1 37
7. 03/30/24 10:01:00 52 3705456 listener -1 38
8. 03/30/24 10:01:00 56 3705456 listener -1 39
9. 03/30/24 10:01:00 37 3705456 listener -1 40
10. 03/30/24 10:01:00 35 3705456 listener -1 41
11. 03/30/24 10:01:00 35 3705456 listener -1 42
12. 03/30/24 10:01:00 35 3705456 listener -1 43
13. 03/30/24 10:01:00 56 3705456 listener -1 44
14. 03/30/24 10:01:00 52 3705456 listener -1 45
15. 03/30/24 10:01:01 56 3705821 talker -1 58
16. 03/30/24 10:01:01 52 3705821 talker -1 59
17. 03/30/24 10:01:01 35 3705821 talker -1 60
18. 03/30/24 10:01:01 35 3705821 talker -1 61
19. 03/30/24 10:01:01 56 3705821 talker -1 62
20. 03/30/24 10:01:01 52 3705821 talker -1 63
21. 03/30/24 10:01:01 56 3705821 talker -1 64
22. 03/30/24 10:01:01 37 3705821 talker -1 65
23. 03/30/24 10:01:01 35 3705821 talker -1 66
24. 03/30/24 10:01:01 56 3705821 talker -1 67
25. 03/30/24 10:01:01 35 3705821 talker -1 68
26. 03/30/24 10:01:01 56 3705821 talker -1 69
27. 03/30/24 10:01:01 56 3705821 talker -1 70
28. 03/30/24 10:01:01 37 3705821 talker -1 71
29. 03/30/24 10:01:01 35 3705821 talker -1 72
30. 03/30/24 10:01:01 35 3705821 talker -1 73
31. 03/30/24 10:01:01 35 3705821 talker -1 74
32. 03/30/24 10:01:01 56 3705821 talker -1 75
33. 03/30/24 10:01:01 52 3705821 talker -1 76
34. 03/30/24 10:01:01 56 3705821 talker -1 77
35. 03/30/24 10:01:01 37 3705821 talker -1 78
36. 03/30/24 10:01:01 35 3705821 talker -1 79
37. 03/30/24 10:01:01 56 3705821 talker -1 80
38. 03/30/24 10:01:01 35 3705821 talker -1 81
39. 03/30/24 10:01:01 56 3705821 talker -1 82
40. 03/30/24 10:01:01 56 3705456 listener -1 83
41. 03/30/24 10:01:01 56 3705456 listener -1 84
42. 03/30/24 10:01:01 37 3705456 listener -1 85
43. 03/30/24 10:01:01 35 3705456 listener -1 86
44. 03/30/24 10:01:01 56 3705456 listener -1 87
45. 03/30/24 10:01:01 35 3705456 listener -1 88
46. 03/30/24 10:01:01 56 3705456 listener -1 89
47. 03/30/24 10:01:01 56 3705821 talker -1 90
48. 03/30/24 10:01:57 35 3705456 listener -1 107 # <--------------------- 35 means "unlink"
49. 03/30/24 10:01:57 35 3705456 listener -1 108
50. 03/30/24 10:01:57 35 3705821 talker -1 109
51. 03/30/24 10:01:57 35 3705821 talker -1 110
This is one of the detailed logs of ausearch.
time->Sat Mar 30 10:01:57 2024
type=PROCTITLE msg=audit(1711764117.836:109): proctitle=2F7573722F62696E2F707974686F6E33002F686F6D652F6F7263612F77732F696E7374616C6C2F70795F7075627375622F6C69622F70795F7075627375622F74616C6B6572
type=PATH msg=audit(1711764117.836:109): item=1 name="/dev/shm/fastrtps_261c7b3421ee9950_el" inode=26424 dev=00:19 mode=0100644 ouid=1000 ogid=1000 rdev=00:00 obj=unlabeled nametype=DELETE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1711764117.836:109): item=0 name="/dev/shm/" inode=1 dev=00:19 mode=041777 ouid=0 ogid=0 rdev=00:00 obj=unlabeled nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1711764117.836:109): cwd="/home/orca"
type=SYSCALL msg=audit(1711764117.836:109): arch=c00000b7 syscall=35 success=yes exit=0 a0=ffffffffffffff9c a1=ffff88000c40 a2=0 a3=25 items=2 ppid=3705454 pid=3705821 auid=4294967295 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=(none) ses=4294967295 comm="talker" exe="/usr/bin/python3.10" subj=kernel key="shm_monitor"
So, is it possible that after stopping user-1000.slice, the processes received some kind of signals, resulting in the shared memory being closed? However, I didn't see any logs regarding this.
I conducted multiple experiments again and deconstructed the systemctl stop user-1000.slice
command.
First, I changed the StopWhenUnneeded
setting in the user-1000.slice configuration to no to ensure that the service would not stop automatically.
Then I executed systemctl stop user@1000.service
and umount /run/user/1000
in sequence, and the program continued to run normally.
And then, I logged out of SSH and logged back in immediately, and the program was still running normally.
However, if I log out of SSH for a certain period of time, the program stops running. At the same time, the user-1000.slice
is in an active state. I have no idea how this SSH session is related to ROS running.
I solved it by command loginctl enable-linger orca
Bug report
Required Info:
ros:humble-ros-base
imageSteps to reproduce issue
Expected behavior
Program is running normally.
Actual behavior
Four hours later, the listener node no longer prints logs, and messages cannot be received through
ros2 topic echo topic
either.After restarting the talker, everything returns to normal. If I use CycloneDDS, there are no problems either.
Additional information
When running normally, data packets can be captured with tcpdump. cmdline:
tcpdump -i any -X udp portrange 7401-7500
After the talker crashes, tcpdump cannot capture any packets.
talker.py:
Listener.py: