Closed fujitatomoya closed 1 year ago
CC: @iuhilnehc-ynos @Barry-Xu-2018 @MiguelCompany
@fujitatomoya We have recently merged some changes to Fast DDS dealing with this exact problem. Could you verify if this is fixed with the current head of 2.6.x (i.e. https://github.com/eProsima/Fast-DDS/commit/3fffea5ab04fbf2241a32d51f1513886edf44c72) ?
@MiguelCompany
The version I am using is d0cc5a6d4cf124ed42b8df061cd857144e372888, which is two commits behind the latest version.
And these 2 commits aren't related to SHM codes.
At this version, this problem also can reproduce in my environment.
@Barry-Xu-2018 As the report stated Humble
as the version, I understood it was related to Fast DDS 2.6.x. Your comment seems to suggest this is also reproduced in Rolling
?
@MiguelCompany Yes. I also speculated whether subsequent updates have fixed this issue, so I tried using the rolling version.
@MiguelCompany CC: @fujitatomoya
While issue occurs,
There is one thread in the subscriber_member_function
that consumes almost 100% of CPU usage.
```shell top - 15:16:41 up 4:33, 2 users, load average: 14.46, 9.73, 8.67 Threads: 12 total, 1 running, 11 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.2 us, 0.8 sy, 0.0 ni, 92.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 31804.4 total, 5983.9 free, 18137.0 used, 7683.6 buff/cache MiB Swap: 65536.0 total, 65475.2 free, 60.8 used. 13057.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 581173 chenlh 20 0 698376 105084 68464 R 99.9 0.3 0:26.56 subscriber_memb # consume 100% CPU usage 581140 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:09.58 subscriber_memb 581167 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.00 subscriber_memb 581168 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.00 subscriber_memb 581169 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.00 subscriber_memb 581170 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.85 subscriber_memb 581171 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.48 subscriber_memb 581172 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.00 subscriber_memb 581174 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.00 subscriber_memb 581175 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.19 subscriber_memb 581176 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.60 subscriber_memb 581177 chenlh 20 0 698376 105084 68464 S 0.0 0.3 0:00.03 subscriber_memb ```
Check the backtrace of the thread 581173
```cpp
(gdb) info thread
Id Target Id Frame
* 1 Thread 0x7f713f38e1c0 (LWP 581140) "subscriber_memb" __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55adaa858260) at ./nptl/futex-internal.c:57
2 Thread 0x7f713cc6e640 (LWP 581167) "subscriber_memb" __futex_abstimed_wait_common64 (private=
after debugging, we found that find_segement
open a segment named fastrtps_fed84726591b50b4
all the time.
Thread 8 "subscriber_memb" hit Breakpoint 3, eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55adaa065b10, id=...) at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1296
1296 segment = std::make_shared<SharedMemSegment>(boost::interprocess::open_only, segment_name);
(gdb) p segment_name
$3 = "fastrtps_fed84726591b50b4"
(gdb) c
Continuing.
Thread 8 "subscriber_memb" hit Breakpoint 3, eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55adaa065b10, id=...) at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1296
1296 segment = std::make_shared<SharedMemSegment>(boost::interprocess::open_only, segment_name);
(gdb) p segment_name
$4 = "fastrtps_fed84726591b50b4"
(gdb) c
Continuing.
Thread 8 "subscriber_memb" hit Breakpoint 3, eprosima::fastdds::rtps::SharedMemManager::find_segment (this=0x55adaa065b10, id=...) at /home/chenlh/Projects/ROS2/ros2-master/src/eProsima/Fast-DDS/src/cpp/rtps/transport/shared_mem/SharedMemManager.hpp:1296
1296 segment = std::make_shared<SharedMemSegment>(boost::interprocess::open_only, segment_name);
(gdb) p segment_name
$5 = "fastrtps_fed84726591b50b4"
but the file fastrtps_fed84726591b50b4
does not exist in /dev/shm
.
$ ls /dev/shm/fastrtps_fed84726591b50b4
ls: cannot access '/dev/shm/fastrtps_fed84726591b50b4': No such file or directory
At this scenario,subscriber always try to open deleted shared memeory file .
BTW, we think this problem is relevant to another issue https://github.com/ros2/ros2cli/issues/582#issuecomment-1321799828.
@Barry-Xu-2018 Thank you very much for the investigation. I was getting almost to the same place on my debugging session.
I've been testing some changes that solve the 100% CPU usage, but they do not always solve the main issue (i.e. step 4 sometimes succeeds, and sometimes fails).
We did a workaround on https://github.com/eProsima/Fast-DDS/commit/687104a269eb58491d7d9498390dd99543ff86e9 which fixes this issue. We are considering whether to add that commit into the supported branches, but we first need to evaluate the impact of incorporating those changes.
@MiguelCompany Thank you for your quick response.
The workaround https://github.com/eProsima/Fast-DDS/commit/687104a269eb58491d7d9498390dd99543ff86e9 is to avoid using SHM for metatraffic if there is other transport (such as UDP).
Could you briefly introduce what information belongs to metatraffic? If this information is transmitted through UDP, would it affect the performance of SHM transmission?
I found the definition on metatraffic in DDS spec.
So this information doesn't directly affect the performance of SHM transmission.
@Barry-Xu-2018 Correct. Metatraffic is basically discovery traffic.
@Barry-Xu-2018 I made some bugfixes on eProsima/Fast-DDS#3759, and made the meta-traffic be transmitted on UDP by default on eProsima/Fast-DDS#3753.
With those two on a ROS 2 rolling workspace, I could not reproduce this issue anymore.
@MiguelCompany
Thank you for the further correction on this issue.
In my environment, I also cannot reproduce this issue with these 2 patches.
BTW, those 2 patches will backport to the version of FastDDS used by Humbel. Right ?
In my environment, I also cannot reproduce this issue with these 2 patches.
neither can I. thanks for the patches.
BTW, those 2 patches will backport to the version of FastDDS used by Humbel. Right?
https://github.com/eProsima/Fast-DDS/pull/3753 includes ABI break change, but I am not sure Fast-DDS policy to backport. @MiguelCompany friendly ping?
@fujitatomoya There's no ABI break. We will backport both changes to the iron and humble branches
@fujitatomoya Changes have been backported to all the supported Fast DDS branches, so building ROS 2 from sources will have this issue fixed.
We have already released Fast DDS v2.6.6 (for humble), and will release v2.11.2 (rolling), and v2.10.2 (iron) this week or next.
We have already released Fast DDS v2.6.6 (for humble), and will release v2.11.2 (rolling), and v2.10.2 (iron) this week or next.
@MiguelCompany Any possibility of backportting this to v2.0.x(foxy) which is the last LTS ros2 version?
Thanks.
@MiguelCompany Any possibility of backportting this to v2.0.x(foxy) which is the last LTS ros2 version?
Unfortunately, no. Foxy is now End-of-Life, so we have no plans to do updates to it.
The good news is that Humble is the next LTS, which should have these fixes in it.
@fujitatomoya Patches to Fast DDS were merged, backported, and released. Do you think we can close this issue?
@MiguelCompany yeah, right. i will go ahead to close this one.
Bug report
Required Info:
Steps to reproduce issue
The following can be reproducible only if it is in the localhost. (shared memory transport is enabled)
Expected behavior
subscription can receive the message with procedure-4.
Actual behavior
subscription cannot receive the message with procedure-4.
Additional information
ros2 node list --no-daemon
does not show subscriber running in first terminal.