ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
112 stars 91 forks source link

Publisher GID in message info of taken message does not match GID of source publisher #377

Open christophebedard opened 2 years ago

christophebedard commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

Have a publisher that publishes a message, and have a subscription that receives that message.

Normal inter-process setup, no shared memory or anything.

Expected behavior

These two are equal:

  1. Publisher GID in the rmw message info struct when the subscription takes a message
  2. GID of the rmw publisher (and underlying DDS writer GID/GUID; the conversion is trivial) that sent the message that gets taken by the subscription

when it's the same publisher.

Actual behavior

Publisher GIDs do not match.

Additional information

The GID of a publisher is recorded here in rmw_create_publisher(): https://github.com/ros2/rmw_cyclonedds/blob/35bdd1ea0805fc16dccdcf6d6f07e587135e9149/rmw_cyclonedds_cpp/src/rmw_node.cpp#L2457

I'm recording the source publisher GID here in rmw_take(): https://github.com/ros2/rmw_cyclonedds/blob/35bdd1ea0805fc16dccdcf6d6f07e587135e9149/rmw_cyclonedds_cpp/src/rmw_node.cpp#L3132-L3137 using this addition:

diff --git a/rmw_cyclonedds_cpp/src/rmw_node.cpp b/rmw_cyclonedds_cpp/src/rmw_node.cpp
index 681bb03..c81b08c 100644
--- a/rmw_cyclonedds_cpp/src/rmw_node.cpp
+++ b/rmw_cyclonedds_cpp/src/rmw_node.cpp
@@ -3134,6 +3134,7 @@ take_done:
     static_cast<const void *>(subscription),
     static_cast<const void *>(ros_message),
     (message_info ? message_info->source_timestamp : 0LL),
+    (message_info ? message_info->publisher_gid.data : nullptr),
     *taken);
   return RMW_RET_OK;
 }

Less than 10 lines above that, it's copying publication_handle from the dds_sample_info_t struct into into message_info->publisher_gid. I'm not sure whether that's actually a GID: https://github.com/ros2/rmw_cyclonedds/blob/35bdd1ea0805fc16dccdcf6d6f07e587135e9149/rmw_cyclonedds_cpp/src/rmw_node.cpp#L3116

I tried to do get_entity_gid(info.publication_handle, message_info->publisher_gid); instead of simply copying the values, but the GIDs don't seem to match either.

Note that I don't think this behaviour is tested anywhere in ROS 2. However, publisher GIDs seem to match correctly with rmw_fastrtps_cpp.

christophebedard commented 2 years ago

Note that I don't think this behaviour is tested anywhere in ROS 2.

I added a test to test_rmw_implementation to cover this: https://github.com/christophebedard/rmw_implementation/commit/76054846fa88a82da711d043322d15219f0f1311. However, the test passed on all RMW implementations in my workspace (except rmw_email_cpp :laughing:). To properly test and compare interprocess vs intraprocess, I've modified/created simple ping-pong test cases and was able to reproduce the issue:

  1. Simple node that publishes a message which is received by another node in another process.
  2. Same thing but the publisher and subscription are in the same node in the same process.

The publisher's GID is printed after it's created. The publisher GID from the message info data is printed in the subscription callback. These two should match. We run each test-case with rmw_fastrtps_cpp and rmw_cyclonedds_cpp.

Steps:

  1. Use the test-tracetools-pingpong-print-publisher-gid branch from ros2_tracing: https://gitlab.com/ros-tracing/ros2_tracing/-/commits/test-tracetools-pingpong-print-publisher-gid
  2. Build up-to test_tracetools
  3. Open two terminals and source workspace
  4. Test-case 1 (interprocess)
    1. rmw_fastrtps_cpp: GIDs match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/test_tracetools/test_ping
      original: pub gid: 1 15 138 60 134 102 14 96 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
      # Terminal 2
      $ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/test_tracetools/test_pong
      sub msg : pub gid: 1 15 138 60 134 102 14 96 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
    2. rmw_cyclonedds_cpp: GIDs do not match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/test_tracetools/test_ping
      original: pub gid: 255 131 30 82 99 164 26 64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      # Terminal 2
      $ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/test_tracetools/test_pong
      sub msg : pub gid: 239 202 37 164 203 102 85 109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  5. Test-case 2 (same process)
    1. rmw_fastrtps_cpp: GIDs match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_fastrtps_cpp ./build/test_tracetools/test_pingpong 
      original: pub gid: 1 15 138 60 54 105 11 201 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
      sub msg : pub gid: 1 15 138 60 54 105 11 201 1 0 0 0 0 0 19 3 0 0 0 0 0 0 0 0
    2. rmw_cyclonedds_cpp: GIDs match
      # Terminal 1
      $ RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ./build/test_tracetools/test_pingpong 
      original: pub gid: 146 105 131 154 211 166 10 147 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
      sub msg : pub gid: 146 105 131 154 211 166 10 147 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Conclusion: it's fine when pub/sub is within the same process (or node), but the publisher GIDs do not match in normal interprocess communications.

eboasson commented 2 years ago

I agree it is fair to call this discrepancy a bug, though prior to Feb 10 2020 you wouldn't have been able to observe this because originally there was nothing in the ROS 2 interface that required a globally unique identifier and everything in the Cyclone DDS RMW layer used the same identifier that you see here.

That identifier is what is called the "instance handle" for the DDS DataWriter as it is visible in the DDS discovery topics by taking a subscription to to the DCPSPublication topic, and what is made available to the data reader in the "publication_handle" of the SampleInfo (2.2.2.5.5 "SampleInfo Class" in the DDS spec).

The GUID is not in the standardised fields of this SampleInfo, and it is also not in Cyclone's. Adding it would make the sample info even more obese (IMHO) and assuming I find the time to restructure a few things in the innards of Cyclone, it won't be easily available internally for inclusion in the sample info either. The standards-conforming way to get it is to convert the "publication_handle" into the publisher's GUID and to that efficiently you'll need to maintain a mapping. Either in the Cyclone DDS RMW layer itself, or, more elegantly, in the dds_common package.

christophebedard commented 2 years ago

Thanks for the explanation!

That identifier is what is called the "instance handle" for the DDS DataWriter as it is visible in the DDS discovery topics by taking a subscription to to the DCPSPublication topic, and what is made available to the data reader in the "publication_handle" of the SampleInfo (2.2.2.5.5 "SampleInfo Class" in the DDS spec).

The standards-conforming way to get it is to convert the "publication_handle" into the publisher's GUID and to that efficiently you'll need to maintain a mapping. Either in the Cyclone DDS RMW layer itself, or, more elegantly, in the dds_common package.

Looks like most of this exists already: https://github.com/ros2/rmw_cyclonedds/blob/e1a1c62fbb14eda24cd5269ac789217a8d325d3d/rmw_cyclonedds_cpp/src/rmw_node.cpp#L876

Implementation-wise, is the mapping from dds_builtintopic_endpoint_t's participant_instance_handle to the GID obtained with convert_guid_to_gid() using dds_builtintopic_endpoint_t's participant_key? Like this here: https://github.com/ros2/rmw_cyclonedds/blob/e1a1c62fbb14eda24cd5269ac789217a8d325d3d/rmw_cyclonedds_cpp/src/rmw_node.cpp#L892