micro-ROS / micro-ROS-Agent

ROS 2 package using Micro XRCE-DDS Agent.
Apache License 2.0
97 stars 51 forks source link

Potential endianness issue with big-endian client, little endian Agent #197

Closed gavanderhoorn closed 11 months ago

gavanderhoorn commented 1 year ago

Issue template

Steps to reproduce the issue

This isn't going to be an sscce, but at a high-level:

  1. build Micro-CDR with -DCONFIG_BIG_ENDIANNESS=ON
  2. (for good measure) build the XRCE-DDS Client also with -DUCLIENT_BIG_ENDIANNESS=ON (although super build is not used)
  3. implement minimal micro-ROS Client / node
  4. start regular Docker Humble Agent (amd64)
  5. let Client attempt to connect to Agent

Expected behavior

Successful Client<->Agent connection.

Actual behavior

Agent prints:

[1687530290.460606] info     | UDPv4AgentWindows.cpp | eprosima::uxr::UDPv4Agent::init | running...                     | port: 8888
[1687530290.461309] info     | Root.cpp           | eprosima::uxr::Root::set_verbose_level | logger setup                   | verbose_level: 4
[1687530319.367687] error    | InputMessage.cpp   | eprosima::uxr::InputMessage::log_error | deserialization error          | buffer:
0000: 80 00 00 00 00 00 10 00 58 52 43 45 01 00 01 0F AA BB CC DD 81 01 00 01 00 00 00 01 00 00 00 07
0020: 75 78 72 5F 68 6C 00 10 00 00 00 07 30 31 30 30 30 30 00 00 01 FC
[1687530320.367278] error    | InputMessage.cpp   | eprosima::uxr::InputMessage::log_error | deserialization error          | buffer:
0000: 80 00 00 00 00 00 10 00 58 52 43 45 01 00 01 0F AA BB CC DD 81 01 00 01 00 00 00 01 00 00 00 07
0020: 75 78 72 5F 68 6C 00 10 00 00 00 07 30 31 30 30 30 30 00 00 01 FC
...

and so on.

Client fails to connect (rclc_support_init_with_options (..) returns 1).

Additional information

Comparing the second packet exchanged with the Agent with a working little-endian Client (different system), it looks like some of the fields are swapped (from a Wireshark capture, be: big endian, le: little endian):

   0        1        2        3        4        5        6        7        8        9        10       11       12       13
be 80000000 00001000 58524345 0100010f AABBCCDD 81010001 00000001 00000007 7578725f 686c0000 00000007 30313030 30300000 01fc
le 80000000 00011000 58524345 0100010f 12345678 81013d32 01000000 07000000 7578725f 686c002d 07000000 30313030 30300000 fc01

Note how the 6th, 7th, 10th and 13th groups are mirror images.

The first packet also seems to have the last 4 bytes swapped (0x00000002 vs 0x02000000).

Perhaps I misunderstood, but I expected CONFIG_BIG_ENDIANNESS=ON at the Micro-CDR level to take care of byte-swapping.

Is more configuration needed? Should the client be setup differently perhaps?


Edit: I've checked and Micro-CDR appears to be correctly configured/built by CMake. Relevant part of the generated config.h:

// ucdrEndianness defines
#define UCDR_MACHINE_ENDIANNESS UCDR_BIG_ENDIANNESS
gavanderhoorn commented 1 year ago

Friendly ping?

gavanderhoorn commented 1 year ago

Friendly ping.

@pablogs9 @Acuadros95 would you perhaps have any suggestions on a possible direction for debugging?

pablogs9 commented 1 year ago

Hello @gavanderhoorn, we should fit a time slot to take a look at that, but currently, we are quite busy. I'm going to add this to our internal roadmap we will take a look at some point.

Probably this error is an issue at Micro XRCE-DDS Client level not serializing/copying some header/payload properly when BE is configured, let us know if you find something. If not, we will check it later.

BTW if this issue is critical for your use case, feel free to ask for commercial support at info@eProsima.com or just ask me and we can schedule a meeting with our sales team.

gavanderhoorn commented 1 year ago

Ok, I'll take that into consideration.

But you can confirm that setting UCDR_MACHINE_ENDIANNESS should be sufficient in this case?

pablogs9 commented 1 year ago

Yes, it should be enought

Acuadros95 commented 11 months ago

@gavanderhoorn We found some issues on the Agent deserialization process with big-endian clients.

Please try out this PR with the fixes https://github.com/eProsima/Micro-XRCE-DDS-Agent/pull/336, will merge when your feedback is OK.

gavanderhoorn commented 11 months ago

@Acuadros95: absolutely fantastic. I had not yet found time to look into this myself -- and I also didn't expect you guys to be so quick to work on this.

Purely for effort already :100: :+1:

I'll test these changes and update you here -- or would you prefer I comment on the PR?

Acuadros95 commented 11 months ago

No problem!

I'll test these changes and update you here -- or would you prefer I comment on the PR?

Comment on the PR and we will close the issue when merged.

Acuadros95 commented 11 months ago

Closing as https://github.com/eProsima/Micro-XRCE-DDS-Agent/pull/336 is merged. Reopen if other issues are found

gavanderhoorn commented 10 months ago

@Acuadros95 @pablogs9: with more time to test, I'm getting to the point of actually sending data between the Client and the Agent and I believe I'm running into another endianness issue.

For context: same systems as described earlier in this issue (ie: BE Client, LE Agent, all Fast-DDS on the ROS 2 side), Humble on both sides. To rule out issues with our use of the micro-ROS API, I've ported the addtwoints_server and int32_publisher demos from micro-ROS-demos to run on the client hw. Main difference: we don't use Agent auto-detection, so some code from the configured_publisher was copied.

Registration of Client to Agent works immediately.

The int32_publisher starts publishing data, and Wireshark shows the data is as I would expect it. Example:

Click to expand ``` DDS-XRCE Protocol, sessionId: 0x81, streamId: 0x80 sessionId: 0x81 streamId: 0x80 (STREAMID_BUILTIN_RELIABLE) sequenceNr: 15 WRITE_DATA submessageId: 7 flags: 0x00, FORMAT_DATA .... 000. = DataFormat: FORMAT_DATA (0x0) .... ...0 = Endianness bit: Not set (0x0) submessageLength: 8 payload (WRITE_DATA_Payload_Data) request_id: 0019 object_id: 0005 (DATAWRITER) [topic: rt/std_msgs_msg_Int32] [type: std_msgs::msg::dds_::Int32_] data serialized_data: 00 00 00 0b ```

(this is when the publisher publishes the value 11 -- hence the Not set for Endianness bit, which is correct IIUC the DDS-XRCE standard)

On the ROS 2 side, ros2 topic echo /std_msgs_msg_Int32 however shows:

Click to expand ``` root@host:/# ros2 topic echo /std_msgs_msg_Int32 data: 134217728 --- data: 150994944 --- data: 167772160 --- data: 184549376 --- data: 201326592 --- data: 218103808 --- data: 234881024 --- data: 251658240 --- data: 268435456 --- data: 285212672 --- data: 301989888 --- data: 318767104 --- data: 335544320 --- data: 352321536 --- data: 369098752 --- data: 385875968 --- ```

Initially I thought this was garbage data (ie: unitialised memory, misaligned reads, etc), but actually:

Click to expand ``` 134217728 == 0x08000000 150994944 == 0x09000000 167772160 == 0x0A000000 184549376 == 0x0B000000 201326592 == 0x0C000000 218103808 == 0x0D000000 234881024 == 0x0E000000 251658240 == 0x0F000000 268435456 == 0x10000000 285212672 == 0x11000000 301989888 == 0x12000000 318767104 == 0x13000000 335544320 == 0x14000000 352321536 == 0x15000000 369098752 == 0x16000000 385875968 == 0x17000000 ```

Unfortunately I didn't capture the DDS side of the traffic (ie: between the Agent and the ros2 topic echo node).

Technically I believe this should be OK (the values do represent little-endian integers in some way). Not sure what's making ros2 topic echo print those 'strange' values.

Is this still a problem in our/the application code, or should payload data have been swapped automatically?


I can provide a Wireshark capture, but it would basically show what I explained above. The application code is int32_publisher from the micro-ROS-demos, but I could make it available if it'd help.


Edit: I've searched for mentions of endian, ENDIAN, swap, etc in rcl, rclc, rosidl_typesupport_microxrcedds, rcutils, rmw_microxrcedds and similar packages, but don't seem to find any. My assumption so far is that configuring the Client and Micro-CDR should suffice (as discussed earlier in this issue).


Edit 2: could it be the endianess information is lost when the Agent forwards the DDS-XRCE message? If ROS 2 'sees' a LE flag while the payload is BE, that would result in what ros2 topic echo shows me.

gavanderhoorn commented 10 months ago

Just captured the ROS 2 (Fast-)RTPS traffic for the same test (repeated it).

I can't be sure, but I believe something is not entirely correct wrt the endianness encapsulation in the messages sent by the Agent.

From the capture, an RTPS packet containing a DDS-XRCE payload forwarded by the Agent:

submessageId: DATA (0x15)
    Flags: 0x05, Data present, Endianness bit
        [...]
    [...]
    [Topic Information (from Discovery)]
        [topic: rt/std_msgs_msg_Int32]
        [typeName: std_msgs::msg::dds_::Int32_]
        [DCPSPublicationData In: 185]
    writerSeqNumber: 19
    serializedData
        encapsulation kind: CDR_LE (0x0001)
        encapsulation options: 0x0000
        serializedData: 00000012

Notice how encapculation kind is set to CDR_LE, while serializedData seems to contain a big-endian 0x12. Interpreting that as little-endian would result in 301989888, not 18.


Edit: looks like services have a similar problem. Payloads coming from an LE client are not swapped. The addtwoints_server will still add a and b, but it won't see swapped BE ints.

gavanderhoorn commented 10 months ago

I've spent some time trying to find where the encapsulation of specific (sub)messages is set, but I must admit that without some (architecture) documentation, there are a few too many 'moving parts' to quickly & easily determine what's going on exactly.

pablogs9 commented 10 months ago

Hello @gavanderhoorn, I'm back from holidays and I have this as TODO for today, let me check and I will report here probably today.

pablogs9 commented 10 months ago

Hello @gavanderhoorn I have drafted two approaches to solve this issue.

The main point is that the Micro XRCE-DDS "Generic" Typesupport always writes the Representation Header (inside the RTPS Data Payload) as Little Endian:

https://github.com/eProsima/Micro-XRCE-DDS-Agent/blob/40954c2379dfcb2bfa9f6ba22146c57c6742b5b7/src/cpp/types/TopicPubSubType.cpp#L31-L34

This line makes your Wireshark packet to decode Representation Header as:

        encapsulation kind: CDR_LE (0x0001)
        encapsulation options: 0x0000

Note that this "Generic" Typesupport is a Typesupport that serializes a byte array in order to allow the Micro XRCE-DDS Agent type agnostic. Using it, the Agent just forwards the Client's messages to the DDS Dataspace, with no ser/des procedure.

So we have two options here, they are independent.

Option 1: Make micro-ROS RMW to use always Little Endian

Branch:

This branch modifies the micro-ROS RMW to prepare the writing buffers and then ensure that the buffer endianness is LE. By default, the buffer preparation method returns a buffer that has the machine's endianness.

Note that after this endianness enforcement, the cdr_serialize is called, this call will use the Micro-CDR API. And the default calls to the Micro-CDR API will use the endianness configured in the ucdrBuffer.

Note that:

Option 2: Assume the payload to have the same endianness as the XRCE protocol

Branch:

As you know, the Micro XRCE-DDS Client library can be configured to use UCLIENT_BIG_ENDIANNESS. In general, this flag will make the XRCE sub-messages have the endianness flag enabled, indicating the XRCE communication is BE.

This does not mean that the payload (that is completely up to the application user) is not in BE nor in LE. But as I have mentioned in the beginning, the Micro XRCE-DDS Agent assumes LE.

This approach, adds a configuration parameter in the XRCE Session Creation procedure named uxr_be_pl. When a Micro XRCE-DDS Client starts a session with this parameter, it indicates that the client intends to serialize its payloads in BE.

In this draft approach, I assume that if the library is configured to use BE wire protocol (by means of UCLIENT_BIG_ENDIANNESS) it will also serialize the data in BE. This assumption may not be true.

So, when the Client sends the session's property uxr_be_pl, the Agent, will configure this client session as BE and will configure the Representation Header correctly.

Note that:


In general, those are the approaches I would take, what is your opinion? The first option can be merged directly, the second one will have more test/CI work.

Let me know your thoughts and your test results, so we can choose an option and go ahead.

gavanderhoorn commented 10 months ago

Thanks @pablogs9. You've confirmed my analysis, and also the two options I saw.

If I understand you correctly, approach two would basically come down to:

  1. XRCE Client sets BE session property
  2. XRCE Client sends BE data
  3. XRCE Agent receives BE data
  4. XRCE Agent sets encapsulation kind to CDR_BE
  5. ROS/regular DDS subscriber receives BE data, deserialises to desired endianness

action 4 would inform receiving entities data is in BE format, so they would use an appropriate deserialisation strategy to get data into the endianness they'd prefer/require (action 5).

Correct?

If yes, then my initial reaction would be that I'd find this approach 'cleaner'.

It would also seem to put most of the work (ie: swapping) on the consumer's side, which in the case of Micro-XRCE DDS seems to make sense. Consumers are most likely more powerful systems for which swapping operations might be less costly than for producers.

(and minor, but traditionally all network traffic was always BE)

In this draft approach, I assume that if the library is configured to use BE wire protocol (by means of UCLIENT_BIG_ENDIANNESS) it will also serialize the data in BE. This assumption may not be true.

Are there any platforms you support today which would require UCLIENT_BIG_ENDIANNESS to be ON, but then serialise data with a different endianness?

IIRC, there are systems that are BE for binaries (ie: opcodes are encoded BE) and LE for data, but again IIRC those are rather rare.

Just so IIUC: with wire protocol, you mean the DDS XRCE binary representation et al. defined by the DDS XRCE standard, correct? Not a custom one?

gavanderhoorn commented 10 months ago

Btw: the Client I can easily incorporate in the application. So approach 1 is easily tested.

Approach 2 also requires changes to the Agent. Would you have a workflow for building that from a different branch quickly/easily?

pablogs9 commented 10 months ago

Thanks @pablogs9. You've confirmed my analysis, and also the two options I saw.

If I understand you correctly, approach two would basically come down to:

  1. XRCE Client sets BE session property
  2. XRCE Client sends BE data
  3. XRCE Agent receives BE data
  4. XRCE Agent sets encapsulation kind to CDR_BE
  5. ROS/regular DDS subscriber receives BE data, deserialises to desired endianness

action 4 would inform receiving entities data is in BE format, so they would use an appropriate deserialisation strategy to get data into the endianness they'd prefer/require (action 5).

Correct?

That's it.

If yes, then my initial reaction would be that I'd find this approach 'cleaner'.

It would also seem to put most of the work (ie: swapping) on the consumer's side, which in the case of Micro-XRCE DDS seems to make sense. Consumers are most likely more powerful systems for which swapping operations might be less costly than for producers.

Agree with this.

(and minor, but traditionally all network traffic was always BE)

In this draft approach, I assume that if the library is configured to use BE wire protocol (by means of UCLIENT_BIG_ENDIANNESS) it will also serialize the data in BE. This assumption may not be true.

Are there any platforms you support today which would require UCLIENT_BIG_ENDIANNESS to be ON, but then serialise data with a different endianness?

I have never seen this case, but also it is true that AFAIK you are the only ones that use XRCE/micro-ROS in BE.

Just so IIUC: with wire protocol, you mean the DDS XRCE binary representation et al. defined by the DDS XRCE standard, correct? Not a custom one?

Yes, when UCLIENT_BIG_ENDIANNESS is set means that the DDS XRCE Wire Protocol (implemented in Micro XRCE-DDS). This makes no assumption on the transmitted payload as far as our implementation is type agnostic.

Btw: the Client I can easily incorporate in the application. So approach 1 is easily tested.

Let me know if it works.

Approach 2 also requires changes to the Agent. Would you have a workflow for building that from a different branch quickly/easily?

I would just checkout the branch and follow the classic CMake procedure of mkdir build; cd build; cmake ..; make. It shall be enough for testing this fix, you won't have micro-ROS Agent features (ROS graph handling, etc) but it should be enought for testing if your remote data consumers works well with the BE issue fix.

As far as you provide positive feedback on the solution, I will proceed with creating and merging a PR.

gavanderhoorn commented 10 months ago

Btw: the Client I can easily incorporate in the application. So approach 1 is easily tested.

Let me know if it works.

Just performed a quick test (with approach 1): I've cherry-picked https://github.com/micro-ROS/rmw_microxrcedds/commit/6c6962e2e8f77fc985c19bd50e47f134d57190a6 onto the humble branch (we're using Humble for now), rebuilt our libmicroros and linked that against int32_publisher and addtwoints_server.

As expected, all payloads coming from the client are now marked CDR_LE.

The int32_publisher worked: publishes LE integers now and ros2 topic echo also prints them correctly.

addtwoints_server didn't want to cooperate though: invoking the service causes the client to hang, the server just printing Service request value: 0 + 0., irrespective of what values I actually ask it to add.

Wireshark shows two LE uint64_ts being serialised by the Agent and sent to the client.

Client then stays 'silent' and does not send back any traffic any more.

I'll see if I can make the capture available tomorrow.

pablogs9 commented 10 months ago

Regarding the services you mention, we have the very same problem here: the transparent forwarding made by the Agent.

As you can see here, the Agent deserializes the received payload (DDS Dataspace -> XRCE path) skipping the representation header. This way, when the raw serialized data (the byte array) is forwarded to the XRCE Client, there is no endianness information.

This drives me to another solution

Option 3: XRCE Client incoming and outcoming payloads shall include Representation Header

Micro XRCE-DDS Agent shall not handle the representation header here and here and shall perfect forward what XRCE Client or remote DDS Writers sends.

This way the XRCE Client uses who is in charge of writing or reading the Representation Header and handling the payload correctly. This implies the following changes:

  1. Micro-XRCE-DDS-Gen shall add code for ser/des the Representation Header according to configured machine endianness in XRCE Client.
  2. Micro XRCE-DDS Agent shall avoid adding or removing the Representation header when handling incoming or outcoming payloads.
  3. rmw_microxrcedds shall
    • Add a Representation Header before calling micro-ROS type support
    • Take into account the received Representation Header when receiving data and configure the deserialization buffer accordingly.
  4. Micro-XRCE-DDS-Client examples needs to be updated

Considerations:

  1. Those changes will break compatibility in XRCE with previous versions, so this can be implemented in eProsima XRCE-DDS v3.
  2. This will break compatibility with micro-ROS currently active versions, so it can only be implemented in Rolling.
gavanderhoorn commented 10 months ago

I'm sure I haven't had sufficient :coffee: but would letting the Agent set the encapsulation based on the value set by the Client on the XRCE DDS traffic not be an option?

The current implementation may skip the bytes, but just checking whether the Client has set endianness bit to 0 and then settings the encapsulation kind to CDR_BE -- and vice versa -- doesn't seem like it would be as invasive as what you describe.

That would keep the Agent transparent I'd say, as it doesn't touch the payload, it just peeks at meta-data.

Of course this would mean that BE Clients might receive LE data, which would mean they'd have to swap, but there's no way around that. Similar for BE Clients sending BE data to LE Agents -- but we've discussed that earlier.

I'm sure I'm overlooking something important though.


Edit: and if all else fails, standardising everything on LE (or BE) and incurring the swap somewhere would also seem to solve this.

Perhaps support this as a CMake option, next to UCLIENT_BIG_ENDIANNESS. Then the choice is up to your clients (as in: customers, users), who could probably weigh the costs of swaps on their platform against performance loss better than we can (as in: an ad-hoc group of developers that's trying to make a best-effort guess for the unknown future).