tocinc / rmw_coredx

CoreDX DDS integration layer for ROS2
Apache License 2.0
0 stars 6 forks source link

Messages Being Dropped Without Sleeping In Between Sends #11

Closed nate-jackson closed 6 years ago

nate-jackson commented 7 years ago

I'm seeing an issue with a simple ros node that tries to send two string messages to a simple listener that just listens for the messages and prints them out. The messages are each being sent on a separate topic. I have found that the only somewhat reliable way to see the listener get both messages is to sleep before sending the first, then sleeping again before sending the second. This seems like a brittle, unreliable way to have to work. Is there a better way? I'm using default qos, which is supposed to be reliable delivery. My concern is that we are having packets get dropped.

Sender node:

#include "rclcpp/rclcpp.hpp"
#include <iostream>
#include "std_msgs/msg/string.hpp"

using std_msgs::msg::String;

int main(int argc, char** argv)
{
    rclcpp::init(argc, argv);
    auto node = rclcpp::Node::make_shared("string_sender");

    auto pub1 = node->create_publisher<String>("topic1");
    auto pub2 = node->create_publisher<String>("topic2");

    String msg;
    msg.data = "my data";

    rclcpp::sleep_for(std::chrono::milliseconds(500)); //shouldn't have to do this
    std::cout << "sending 1..." << std::endl;
    pub1->publish(msg);

    rclcpp::sleep_for(std::chrono::milliseconds(500)); //shouldn't have to do this
    std::cout << "sending 2..." << std::endl;
    pub2->publish(msg);

    std::cout << "done sending" << std::endl;

    return 0;
}

Listener node:

#include <iostream>
#include "rclcpp/rclcpp.hpp"
#include "std_msgs/msg/string.hpp"

using std_msgs::msg::String;
using std::cout;
using std::endl;

int main(int argc, char** argv)
{
    rclcpp::init(argc, argv);
    auto node = rclcpp::Node::make_shared("string_listener");

    auto sub1 = node->create_subscription<String>("topic1",
                                                  [](std::shared_ptr<String> msg)
                                                  {
                                                      cout << "got msg on topic1" << endl;
                                                  });
    auto sub2 = node->create_subscription<String>("topic2",
                                                  [](std::shared_ptr<String> msg)
                                                  {
                                                      cout << "got msg on topic2" << endl;
                                                  });

    std::cout << "Listening..." << std::endl;
    rclcpp::spin(node);
    std::cout << "Done listening" << std::endl;

    return 0;
}
nate-jackson commented 7 years ago

I understand why the first sleep might be needed, as perhaps the connection is not yet established. Hopefully there is a good way in ros to determine when anyone is listening to a publisher.

But for the second problem, where the second message is getting dropped if published immediately after the first, I'm starting to wonder if it's not the publisher, but the subscribers. I'm going to create two listener nodes, each listening to one of the above messages, to see if they get dropped.

nate-jackson commented 7 years ago

I created the two separate nodes, one listening to each of the above messages on the different topics. Without sleeping in between the publish calls, I am unable to get the second message. So the bug seems to be on the publish side at least, could be partly both, not sure.

nate-jackson commented 7 years ago

I just tried sleeping much longer before I publish the messages, and back to back publishing works fine in that case, to the separately listening nodes. So it looks like perhaps it was taking a long time to set up the connections.

So then we found a way to determine from ros2 when we have subscribers. node->count_subscribers(topic)

So we give it the topic, wait for that number to be greater than 0, then publish. This still wasn't enough though, but if we also add a spin_some at the end, the messages get published and the listening nodes receive them both, without any unncessary sleeps.

New sender:

int main(int argc, char** argv)
{
    rclcpp::init(argc, argv);
    auto node = rclcpp::Node::make_shared("string_sender");

    auto pub1 = node->create_publisher<String>("topic1");
    auto pub2 = node->create_publisher<String>("topic2");

    String msg1;
    msg1.data = "my data 1";

    String msg2;
    msg2.data = "my data 2";

    while (node->count_subscribers("topic1") <= 0 ||
            node->count_subscribers("topic2") <= 0)
    {
        std::cout << "Not enough subscribers yet (topic1 - "
                  << node->count_subscribers("topic1")
                  << " topic2 - "
                  << node->count_subscribers("topic2")
                  << ")\n";
        rclcpp::sleep_for(std::chrono::milliseconds(100));
    }

    std::cout << "subscribers are good (topic1 - "
              << node->count_subscribers("topic1")
              << " topic2 - "
              << node->count_subscribers("topic2")
              << ")\n";

    std::cout << "sending 1..." << std::endl;
    pub1->publish(msg1);

    std::cout << "sending 2..." << std::endl;
    pub2->publish(msg2);

    std::cout << "done sending, spinning" << std::endl;
    rclcpp::spin_some(node);
    return 0;
nate-jackson commented 7 years ago

When we have lots of nodes running, we don't see 100% reliable data being delivered (with qos default). So reopening.

BrannonKing commented 7 years ago

Does CoreDX set the size of the socket buffer or go with the system default?

nate-jackson commented 7 years ago

sockets are dynamically resized unless you hardcode it to use something with an env var

BrannonKing commented 6 years ago

I fixed this with this changeset: https://github.com/asirobots/rmw_coredx/commit/cb79fcc0d82c9cec7f59b3f68f2ead49eccb5067

nate-jackson commented 6 years ago

We are still seeing this in the newest rmw_coredx. We aded in the changes from the cited changeset from Brannon, but it's not fixing it. However, we are seeing similar, although not as frequent, missed messages with other rmw's. I don't know if this is a valid test case anyway. I'm thinking we can just close it.