ros2 / rmw_fastrtps

Implementation of the ROS Middleware (rmw) Interface using eProsima's Fast RTPS.
Apache License 2.0
154 stars 117 forks source link

Windows 10 shared memory related crash/hang #713

Open headlee opened 12 months ago

headlee commented 12 months ago

Bug report

Required Info:

Steps to reproduce issue

Sometimes when we run our launchfile one or more of our nodes will immediately die or seemingly hang. It seems to be caused by using Fast-RTPS shared memory on Windows...we are able to get around some of these issues by doing fastdds.bat shm clean but today we got the error pasted below and aren't sure how to proceed.

If anyone has any advice to make this reproducible and/or knows how to fix it that would be great. We're hoping that a new Windows Humble patch release with Fast-RTPS/Fast-DDS version bumps (which should solve this) could also help us out, but any advice in the meantime is appreciated!

Additional information

Error message:

[robot_controller.EXE-2] A subdirectory or file C:\ProgramData\eprosima\fastrtps_interprocess already exists.
[robot_controller.EXE-2] 
[robot_controller.EXE-2] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[robot_controller.EXE-2] This error state is being overwritten:
[robot_controller.EXE-2] 
[robot_controller.EXE-2]   '__create_participant failed to create participant, at C:\ci\ws\src\ros2\rmw_fastrtps\rmw_fastrtps_shared_cpp\src\participant.cpp:105, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:264'
[robot_controller.EXE-2] 
[robot_controller.EXE-2] with this new error message:
[robot_controller.EXE-2] 
[robot_controller.EXE-2]   'rcl node's rmw handle is invalid, at C:\ci\ws\src\ros2\rcl\rcl\src\rcl\node.c:416'
[robot_controller.EXE-2] 
[robot_controller.EXE-2] rcutils_reset_error() should be called after error handling to avoid this.
[robot_controller.EXE-2] <<<
[robot_controller.EXE-2] [ERROR] [1695302197.677919300] [rcl]: Failed to fini publisher for node: 1
[ERROR] [robot_controller.EXE-2]: process has died [pid 2704, exit code 3221226505, cmd 'C:\ws\resources\ros2_backend\ros_nodes\install\lib\robodk_yaskawa\robot_controller.EXE --ros-args -r __node:=robot_controller --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_a68uie0v --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_c4v5b50e --params-file C:\Users\daddy\AppData\Local\Temp\launch_params_4v7q3bh_'].

Fast-RTPS profile:

<?xml version="1.0" encoding="UTF-8"?>
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
  <profiles>

    <transport_descriptors>

      <!-- We still want the default UDP transport to be present -->
      <transport_descriptor>
        <transport_id>udp_transport</transport_id>
        <type>UDPv4</type>
      </transport_descriptor>

      <!-- We want a customized version of the SHM transport -->
      <transport_descriptor>
        <transport_id>shm_transport</transport_id>
        <type>SHM</type>
        <!--
            Configure the size in bytes of the shared memory segment.
        -->
        <segment_size>100000000</segment_size>
        <!--
            The following value should be lower than or equal to segment_size.
            The default value is shown here.
        -->
        <maxMessageSize>100000000</maxMessageSize>
      </transport_descriptor>

    </transport_descriptors>

    <!-- Default publisher profile -->
    <publisher profile_name="default publisher profile" is_default_profile="true">
      <qos>
        <publishMode>
          <kind>ASYNCHRONOUS</kind>
        </publishMode>
      </qos>

      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </publisher>
    <participant profile_name="default publisher profile" is_default_profile="true">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>

    <!-- Default subscriber profile -->
    <subscriber profile_name="default subscriber profile" is_default_profile="true">
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </subscriber>
    <participant profile_name="default subscriber profile" is_default_profile="true">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>

    <!-- Request subscriber profile for services -->
    <subscriber profile_name="service">
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </subscriber>
    <participant profile_name="service">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>

    <!-- Request publisher profile for clients -->
    <publisher profile_name="client">
      <qos>
        <publishMode>
          <kind>ASYNCHRONOUS</kind>
        </publishMode>
      </qos>
      <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
    </publisher>
    <participant profile_name="client">
      <rtps>
        <!--
            As customized versions of the default transports are used, the
            built-in ones should be disabled
        -->
        <useBuiltinTransports>false</useBuiltinTransports>

        <!-- Link the Transport Layer to the Participant -->
        <userTransports>
          <transport_id>shm_transport</transport_id>
        </userTransports>
      </rtps>
    </participant>
  </profiles>
</dds>
MiguelCompany commented 12 months ago

There is a humble patch release pending to update Fast DDS to v2.6.6, which has several improvements related to SHM and resilience to crashing nodes.

clalancette commented 11 months ago

@headlee We've now released Humble Patch release 5: https://github.com/ros2/ros2/releases/tag/release-humble-20230925 . Can you check to see if the problem you reported is fixed? Thanks.

headlee commented 10 months ago

The new patch release has definitely helped. We still see some intermittent issues but it's unclear at this point if they are due to fastrtps, ros daemon on Windows, or something else.

Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.

MiguelCompany commented 10 months ago

Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.

@headlee No downsides. You could still do that.

headlee commented 10 months ago

Is there any downside to us deleting the contents of the fastrtps_interprocess folder (C:\ProgramData\eprosima\fastrtps_interprocess) before every launch of our app? That has fixed issues in the past for us.

@headlee No downsides. You could still do that.

Great, thanks!

One more sort of related question: do you know if it's possible to change the directory that eprosima uses to store the shm files on Windows? I found datasharingqospolicy in the docs but I don't know if that's for normal shm or just the DataReader/DataWriter delivery. Haven't had a chance to test it out yet.

MiguelCompany commented 10 months ago

do you know if it's possible to change the directory that eprosima uses to store the shm files on Windows?

@headlee I'm afraid it cannot be changed. It is fixed here