ros-drivers / rosserial

A ROS client library for small, embedded devices, such as Arduino. See: http://wiki.ros.org/rosserial
523 stars 527 forks source link

UDP socket intermittent read errors #255

Open chuck-h opened 8 years ago

chuck-h commented 8 years ago

I am trying out rosserial_server over UDP (ref #245), mostly successful but getting intermittent errors on messages coming from the remote system.

The error appears when AsyncReadBuffer::read() is called for N bytes and does not have enough buffered data to satisfy the call. It then calls boost::asio::async_read() on the UDP stream, as it should. While this usually (99.9%+ of the time) works correctly, occasionally a chunk of data is missing (e.g. there are 45 characters that appeared in the UDP stream but were dropped on the way to boost::asio::async_read()). I can verify that the UDP data stream is intact using Wireshark. The missing data chunks do not line up with UDP packet boundaries.

Environment is Ubuntu 14.04 LTS, ROS jade distro. Hardware:

[Ubuntu laptop] <-- 100baseT --> [Netburner SBL2e] <-- logic-level 115.2kBaud --> [embedded ARM]

Any suggestions on how to instrument/debug this further? Any configuration parameters I should be tweaking? Can anyone tell me what programs & modules make up the stack between where wireshark sniffs it and where boost::asio shows it to the application?

mikepurvis commented 8 years ago

Hi @chuck-h— thanks for giving the UDP support a try! As you can see, it's totally undocumented and experimental, but we're using it internally at Clearpath, so it's great to have other users trying it out as well and finding these kinds of bugs.

In our use case, the embedded chip is speaking UDP natively, rather than through a bridge, so we have a guarantee that it's one datagram per message, though you say that the missing data is arbitrary and not on packet boundaries.

The main thing I can think of for this is set up a series of tests with a bunch of topics ramming through lots of data on the loopback interface, and see if it's possible to reproduce it more quickly/reliably? That would give us a big leg up on diagnosing, particularly if it's a threading issue, either in rosserial_server or roscpp.

I don't know tons about the internal workings of boost::asio, to be honest, but my impression from reading the source is that it's a relatively thin layer over the native POSIX APIs. It doesn't maintain any of its own internal buffers or anything like that, so I wouldn't expect that to be a source of loss.

Are your findings so far just based on lining up a rosbag of sequence-numbered messages with a pcap, or something like that?

gavanderhoorn commented 7 years ago

Also interested in this to try and deal with bridging across a lossy wireless link. Has anyone already worked on a reusable implementation of the 'client' side for UDP?

gavanderhoorn commented 7 years ago

Just spend some time hacking a UDP 'transport' into embedded_linux_comms.c. It seems to work quite ok .. in one direction: from server->client things seem ok. But no matter what I try, I can't get a simple Int32 publisher to reliably, and with a sane period, publish a counter to the ROS pc.

@mikepurvis: have you guys ever experienced anything like this? I realise that without code I can't expect too insightful answers, but at this stage I'm just looking for "o yeah, you need to make sure you X" kind of suggestions.


Edit: two seconds after I posted this I noticed I forgot to O_NONBLOCK the UDP socket. That solved all the issues I was having.

mikepurvis commented 7 years ago

@gavanderhoorn Glad you got it sorted. We've kicked around the rosserial-as-multimaster-transport idea a few times, but it's never made it off the drawing board. Certainly the combination of UDP-everything + namespacing + single port all make it a pretty compelling story, so please keep me posted on how this works out for you.

gavanderhoorn commented 7 years ago

It's been working very nice so far, especially when the robot is in areas with bad wireless. As expected it starts to lose msgs, but for things like teleop that doesn't matter too much (timeout on last cmd).

I should really open a new issue, but what is annoying is the lack of support for parameters, services, etc in rosserial_server. But I guess I shouldn't be complaining, I should be fixing it :).

mikepurvis commented 7 years ago

There's some service support, and params wouldn't be hard to add; some work has been done on it already here: https://github.com/ros-drivers/rosserial/issues/229

The fundamental issue is that the service logic (and by extension parameters, since they're a "magic service") is extremely fragile in the face of unreliable comms. The client code isn't really set up to be asynchronous at all, so the service client is basically just a blocking loop that waits for a response (or a disconnection, which it does not report to the calling code at all):

https://github.com/ros-drivers/rosserial/blob/jade-devel/rosserial_client/src/ros_lib/ros/service_client.h#L61-L62

Not having some kind of explicit timeout/retry policy here has always been extremely unacceptable to me, hence the lack of proper support for it in the C++ server implementation.