micro-ROS / rmw_microxrcedds

RMW implementation using Micro XRCE-DDS middleware.
Apache License 2.0
30 stars 23 forks source link

microros reconnection failed. #264

Closed chrisjaxy closed 1 year ago

chrisjaxy commented 1 year ago

Describe the bug microros reconnection failed.

To Reproduce Steps to reproduce the behaviour: I found that when the microros connection fails or times out, it cannot be reconnected again. I tested pub, sub, client, services and added the log.

client client2 pub pub1 services services2 sub sub2

I found that the reason for the failure is the function get_memory. For failure, I made the following modifications to the code to make it reconnect successfully.

client_change pub_change pub_change2 pub_change3 pub_change4 services_change sub_change sub_change2 sub_change3 sub_change4

My code can refer to the demo program given by this link https://github.com/micro-ROS/micro_ros_setup/issues/434. I created 26 pubs, 22 subs, 4 services and 5 clients. The method of reproduction is to restart the microros during the process of rclc_service_init_best_effort, rclc_client_init_best_effort, rclc_publisher_init_best_effort, rclc_subscription_init_best_effort. The configuration of my colcon meta is as follows:

{
    "names": {
        "tracetools": {
            "cmake-args": [
                "-DTRACETOOLS_DISABLED=ON",
                "-DTRACETOOLS_STATUS_CHECKING_TOOL=OFF"
            ]
        },
        "rosidl_typesupport": {
            "cmake-args": [
                "-DROSIDL_TYPESUPPORT_SINGLE_TYPESUPPORT=ON"
            ]
        },
        "rcl": {
            "cmake-args": [
                "-DBUILD_TESTING=OFF",
                "-DRCL_COMMAND_LINE_ENABLED=OFF",
                "-DRCL_LOGGING_ENABLED=OFF"
            ]
        }, 
        "rcutils": {
            "cmake-args": [
                "-DENABLE_TESTING=OFF",
                "-DRCUTILS_NO_FILESYSTEM=ON",
                "-DRCUTILS_NO_THREAD_SUPPORT=ON",
                "-DRCUTILS_NO_64_ATOMIC=ON",
                "-DRCUTILS_AVOID_DYNAMIC_ALLOCATION=ON"
            ]
        },
        "microxrcedds_client": {
            "cmake-args": [
                "-DUCLIENT_PIC=OFF",
                "-DUCLIENT_PROFILE_UDP=OFF",
                "-DUCLIENT_PROFILE_TCP=OFF",
                "-DUCLIENT_PROFILE_DISCOVERY=OFF",
                "-DUCLIENT_PROFILE_SERIAL=OFF",
                "-UCLIENT_PROFILE_STREAM_FRAMING=ON",
                "-DUCLIENT_PROFILE_MULTITHREAD=ON",
                "-DUCLIENT_PROFILE_CUSTOM_TRANSPORT=ON"
            ]
        },
        "rmw_microxrcedds": {
            "cmake-args": [
                "-DRMW_UXRCE_MAX_NODES=1",
                "-DRMW_UXRCE_MAX_PUBLISHERS=30",
                "-DRMW_UXRCE_MAX_SUBSCRIPTIONS=25",
                "-DRMW_UXRCE_MAX_SERVICES=10",
                "-DRMW_UXRCE_MAX_CLIENTS=5",
                "-DRMW_UXRCE_MAX_HISTORY=2",
                "-DRMW_UXRCE_TRANSPORT=custom",
                "-DRMW_UXRCE_ENTITY_DESTROY_TIMEOUT=0",
                "-DRMW_UXRCE_TOPIC_NAME_MAX_LENGTH=100",
            ]
        }
    }
}

If you have successfully reproduced here and have better modifications, you can notify or tell me.

Expected behaviour After microros reconnection fails or times out, it can still reconnect successfully.

System information (please complete the following information):

pablogs9 commented 1 year ago

Are you destroying your entities when the disconnect is detected? Could you share your micro-ROS app code?

Check this: https://github.com/micro-ROS/micro_ros_arduino/blob/39107e4c81affd714db98d7b63655b843905b99a/examples/micro-ros_reconnection_example/micro-ros_reconnection_example.ino#L88-L92

chrisjaxy commented 1 year ago

Are you destroying your entities when the disconnect is detected? Could you share your micro-ROS app code?

Check this: https://github.com/micro-ROS/micro_ros_arduino/blob/39107e4c81affd714db98d7b63655b843905b99a/examples/micro-ros_reconnection_example/micro-ros_reconnection_example.ino#L88-L92

When the connection fails, I destroy the entities.I refer to the previous program on https://github.com/micro-ROS/micro_ros_setup/issues/434

rclc_support_t test_support;
rcl_node_t test_node;
rclc_executor_t test_executor;
rcl_allocator_t test_allocator;
rcl_publisher_t test_publisher;
rcl_subscription_t test_subscription;

std_msgs__msg__UInt8 pile_heartbeat;
sensor_msgs__msg__BatteryState battery;
bool micro_ros_init_successful = false;
bool micro_ros_destory_flag = false;

void battery_info(const void * msgin)
{
    const sensor_msgs__msg__BatteryState * msg = (const sensor_msgs__msg__BatteryState *)msgin;

    LOG_DEBUG("battert current : %1f ,percentage : %3f, status: %d", msg->current, msg->percentage, msg->power_supply_status);
}

bool create_entities()
{   
    test_allocator = rcl_get_default_allocator();

    // create init_options
    RCCHECK(rclc_support_init(&test_support, 0, NULL, &test_allocator));

    // create node
    RCCHECK(rclc_node_init_default(&test_node, "test", "", &test_support));

    // create publisher
    RCCHECK(rclc_publisher_init_default(
        &test_publisher,
        &test_node,
        ROSIDL_GET_MSG_TYPE_SUPPORT(std_msgs, msg, UInt8),
        "std_msgs_msg_Int8"));

    RCCHECK(rclc_subscription_init_best_effort(
        &test_subscription,
        &test_node,
        ROSIDL_GET_MSG_TYPE_SUPPORT(sensor_msgs, msg, BatteryState),
        "/battery"));

    // create executor
    test_executor = rclc_executor_get_zero_initialized_executor();
    RCCHECK(rclc_executor_init(&test_executor, &test_support.context, 1, &test_allocator));
    RCCHECK(rclc_executor_add_subscription(&test_executor, &test_subscription, &battery, battery_info, ON_NEW_DATA));

    micro_ros_init_successful = true;

    return true;
}

void destroy_entities()
{
    rcl_subscription_fini(&test_subscription, &test_node);
    rcl_publisher_fini(&test_publisher, &test_node);
    rcl_node_fini(&test_node);
    rclc_executor_fini(&test_executor);
    rclc_support_fini(&test_support);

    micro_ros_init_successful = false;
}

void StartDefaultTask(void *argument)
{
    bool pingFlag = false;
    MicroRosChannelConfig();

    while (1)
    {
        osDelay(5);
                if ((micro_ros_init_successful == false) && !pingFlag)
        {
            if (rmw_uros_ping_agent(50, 1) != RMW_RET_OK)
            {
                LOG_DEBUG("ping error!");
                osDelay(200);
                continue;
            }

            LOG_DEBUG("ping ok!");
            pingFlag = true;
        }

        if (!micro_ros_init_successful) {
            LOG_DEBUG("start create_entities!");
            if (create_entities() == false) {
                micro_ros_destory_flag = true;
                LOG_DEBUG("create_entities fail, destroy_entities!");
            } else {
                LOG_DEBUG("create_entities finish");
            }
        } else {
            rclc_executor_spin_some(&test_executor, RCL_MS_TO_NS(1));
        }

        if ((micro_ros_init_successful == true) && (micro_ros_destory_flag == true)){
            LOG_DEBUG("destroy_entities!");
            destroy_entities();
            micro_ros_destory_flag = false;
        }
        }
}
pablogs9 commented 1 year ago

Make sure that you are destroying entities correctly because this get_memory() failure is related to this entity's lack of destruction.

Here you have an approach using a state machine: https://github.com/micro-ROS/micro_ros_arduino/blob/a7fd2280324e9344180512b41d0cfab1cc072bff/examples/micro-ros_reconnection_example/micro-ros_reconnection_example.ino#L104-L127

chrisjaxy commented 1 year ago

Make sure that you are destroying entities correctly because this get_memory() failure is related to this entity's lack of destruction.

Here you have an approach using a state machine: https://github.com/micro-ROS/micro_ros_arduino/blob/a7fd2280324e9344180512b41d0cfab1cc072bff/examples/micro-ros_reconnection_example/micro-ros_reconnection_example.ino#L104-L127

I'm sure that I am destroying entities correctly. When I have destroyed entities, I check the size of xAvailableHeapSpaceInBytes in freertos, it's exactly the same as before connection. You can try to reproduce this problem. The method of reproduction is to restart the microros agent during the process of rclc_service_init_best_effort, rclc_client_init_best_effort, rclc_publisher_init_best_effort, rclc_subscription_init_best_effort.

pablogs9 commented 1 year ago

We will take a look

chrisjaxy commented 1 year ago

We will take a look

If you succeed or fail to reproduce this problem, please let me know.

pablogs9 commented 1 year ago

I have not been able to replicate this. I see that you show in your original message a bunch of images with changes in the rmw_microxrcedds. Is it possible to create a PR with those changes so we can analyze them?

chrisjaxy commented 1 year ago

I have not been able to replicate this. I see that you show in your original message a bunch of images with changes in the rmw_microxrcedds. Is it possible to create a PR with those changes so we can analyze them?

This is the modification we tried. https://github.com/micro-ROS/rmw_microxrcedds/pull/265. We created a number of pub, sub, clent and services, and then restarted the agent to reproduce the connection process. The connection process is fast, making it difficult to restart the agent just in the middle of the connection process. I think it may be easy to reproduce if the baud rate of the communication is reduced.

bjsowa commented 1 year ago

@pablogs9 I believe I'm facing the same issue. My code uses the state machine approach very similar to the example you provided. If the agent successfully connects and later disconnects, all of the entities are correctly destroyed. But, if the agent disconnects during entity creation, some of the entities are not correctly destroyed, leaving allocated items in the mempool.

I applied #265 from @chrisjaxy to humble branch which I'm currently using and that seems to fix the issue for me. Please consider applying these changes.

Acuadros95 commented 1 year ago

This should be fixed with https://github.com/micro-ROS/rmw_microxrcedds/pull/280.

Closing, reopen if the issue persist.