Unstable micro-agent from docker (Foxy/Rolling)

anaelle-sw commented 3 years ago

Hello micro-ROS team! I have some trouble to open a stable session between my board and the micro-ROS agent, when launched from Foxy docker.

Setup

Ubuntu 20.04
ROS2 Rolling
RMW implementation forced to FastRTPS
Teensy 3.2

Steps to reproduce

Build micro-ROS Arduino for Foxy:

cd ~/Arduino/libraries
git clone -b foxy https://github.com/micro-ROS/micro_ros_arduino.git
cat ~/path_to_meta_file/colcon_lowmem.meta > ~/Arduino/libraries/micro_ros_arduino/extras/library_generation/colcon_lowmem.meta
cd ~/Arduino/libraries/micro_ros_arduino
sudo docker pull microros/micro_ros_arduino_builder:foxy
sudo docker run -it --rm -v $(pwd):/arduino_project microros/micro_ros_arduino_builder:foxy -p teensy32

The meta file I use is just a bit modified in order to be able to use re-connection features:

{
    "names": {
        "tracetools": {
            "cmake-args": [
                "-DTRACETOOLS_DISABLED=ON",
                "-DTRACETOOLS_STATUS_CHECKING_TOOL=OFF"
            ]
        },
        "rosidl_typesupport": {
            "cmake-args": [
                "-DROSIDL_TYPESUPPORT_SINGLE_TYPESUPPORT=ON"
            ]
        },
        "rcl": {
            "cmake-args": [
                "-DBUILD_TESTING=OFF",
                "-DRCL_COMMAND_LINE_ENABLED=OFF",
                "-DRCL_LOGGING_ENABLED=OFF"
            ]
        },
        "rcutils": {
            "cmake-args": [
                "-DENABLE_TESTING=OFF",
                "-DRCUTILS_NO_FILESYSTEM=ON",
                "-DRCUTILS_NO_THREAD_SUPPORT=ON",
                "-DRCUTILS_NO_64_ATOMIC=ON",
                "-DRCUTILS_AVOID_DYNAMIC_ALLOCATION=ON"
            ]
        },
        "microxrcedds_client": {
            "cmake-args": [
                "-DUCLIENT_PIC=OFF",
                "-DUCLIENT_PROFILE_UDP=OFF",
                "-DUCLIENT_PROFILE_TCP=OFF",
                "-DUCLIENT_PROFILE_DISCOVERY=OFF",
                "-DUCLIENT_PROFILE_SERIAL=OFF",
                "-UCLIENT_PROFILE_STREAM_FRAMING=ON",
                "-DUCLIENT_PROFILE_CUSTOM_TRANSPORT=ON",
                "-DUCLIENT_MAX_SESSION_CONNECTION_ATTEMPTS=3"
            ]
        },
        "rmw_microxrcedds": {
            "cmake-args": [
                "-DRMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT=0",
                "-DRMW_UXRCE_MAX_NODES=1",
                "-DRMW_UXRCE_MAX_PUBLISHERS=3",
                "-DRMW_UXRCE_MAX_SUBSCRIPTIONS=2",
                "-DRMW_UXRCE_MAX_SERVICES=1",
                "-DRMW_UXRCE_MAX_CLIENTS=0",
                "-DRMW_UXRCE_MAX_HISTORY=1",
                "-DRMW_UXRCE_TRANSPORT=custom"
            ]
        }
    }
}

Then I launch the micro-ROS agent with the board connected:

sudo docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:foxy serial --dev /dev/teensy -v6

Then I upload some examples on the board, for instance re-connection example or the publisher example.

Issue description

I noticed that the behavior of the agent is really unstable, it would often do one of the following: 1 - I sometimes cannot open a session. In this case, the output is just:

[1616083974.580827] info     | TermiosAgentLinux.cpp | init                     | running...             | fd: 4
[1616083974.580939] info     | Root.cpp           | set_verbose_level        | logger setup           | verbose_level: 6

I would try several times to reload the code on the board, or to re-launch the agent. It can take a dozen of tries before the session is established. It especially happens with the publisher example. Is it possible that the custom meta file is interfering with applications that don't use the re-connection features?

2 - Once the session is open, everything run smoothly, the agent output shows all the messages being published and I can get them with ros2 topic echo. But after a while, the agent just freezes and no message are passed anymore. There is no error, but the agent output is stopped. May be the first case is also some kind agent freeze.

3 - After a few tries trying to launch the agent, I noticed with ros2 topic node that the node running on the board was present twice. With htop, I noticed an agent running in a docker but I was not able to kill the process. After a computer reboot, it was gone. It only happened once.

4 - I encountered more "specific" errors. For instance: the re-connection example doesn't always work, or a custom node only publishes one topic instead of three. But it may be related to the previous cases, so I can detail this later on if it is still relevant.

Sorry it is a bit vague. Both (1) and (2) cases are really easy to reproduce since it happens really frequently, but the freeze seems to happen randomly and I was not able to find more precisely what can cause this.

Thanks a lot for support!

pablogs9 commented 3 years ago

I'm trying to reproduce this but:

docker pull microros/micro-ros-agent:foxy 
docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:foxy serial --dev /dev/serial/by-id/usb-Teensyduino_USB_Serial_8180040-if00 -v6

[1616400578.002739] info     | TermiosAgentLinux.cpp | init                     | running...             | fd: 3
[1616400578.002956] info     | Root.cpp           | set_verbose_level        | logger setup           | verbose_level: 6
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Could you reproduce this @jamoralp @Acuadros95? Can we upload the bad_alloc fix asap?

pablogs9 commented 3 years ago

Hello @anaelle-sw could you confirm if the agent behaves in the same way with the new docker images:

docker pull microros/micro-ros-agent:foxy
docker pull microros/micro-ros-agent:rolling

We have found that the bad_alloc problem is no the only one that appears because of the versioning problem... @Acuadros95 has reported that he also has had agents frozen because of this error...

Let us know if you can replicate the behavior detailed here with these new images, if so, I will replicate your scenario here with our Teensy boards.

Thanks and sorry for the delay.

anaelle-sw commented 3 years ago

Hi @pablogs9 I still have the bad_alloc issue, when launching a lot of ROS2 nodes on behalf on the agent, but not right after launching the agent as it used to do sometimes.

To test the issues listed before, I loaded minimal publisher example on the board. I tried to use the Rolling and the Foxy dockers, and to let the RMW implementation to CycloneDDS or to force it to FastRTPS. With all of these environments I got the same results:

The agent keeps freezing (at launch time mainly). I have to launch it several times in a raw, and/or reprogram the board until it works
Sometimes, I cannot kill the agent. For instance, with command htop, I can see there is a process named micro_ros_agent that has been running for 30 minutes, using 100% of one of the CPU cores, and that I cannot kill.

So yes, unfortunately the errors number 1, 2, and 3 from the first post are still happening with the new images.

pablogs9 commented 3 years ago

Some points:

Make sure that you do both docker pull on foxy and rolling images
I have been able to reproduce the bad_alloc under rolling, because it still on 2.0.0, I'm updating this.
I have not been able to reproduce the error on foxy. Can you verify if using the foxy image (even with the rest of the system on rolling), you are having the bad_alloc?

anaelle-sw commented 3 years ago

About the bad_alloc issue, you are right: it only happened with the Rolling image. But the freezing problems happen on both images

pablogs9 commented 3 years ago

Can you confirm how much CPU does the agent consumes during normal operation? Before it freezes.

anaelle-sw commented 3 years ago

With Rolling docker, it uses less than 3% of one CPU core. Do you want me to test it with Foxy dock as well?

pablogs9 commented 3 years ago

Yes please, just a measure with htop

anaelle-sw commented 3 years ago

Well with Foxy docker, htop gives around 2.6% CPU usage

pablogs9 commented 3 years ago

And, if you update the rolling image as stated in https://github.com/micro-ROS/micro_ros_arduino/issues/40, to the newest rolling image (where I hope have fixed the bad_alloc as in Foxy docker) are you still having the freezes?

I want to be sure that both problems are not related. Sorry for being so exhaustive.

anaelle-sw commented 3 years ago

No problem. I try also try on both different setups we can use (dev computers and robots), to be sure the results are consistent, as the bad_alloc issue used to mainly happen on the robot.

So, the Rolling docker doesn't crash with error bad_alloc anymore when launching several ROS2 nodes on behalf of the agent! For the moment, I am not able to reproduce the freeze problem... I will let you know if it happens again.

However, the Rolling docker presents other problems. For instance, on our custom application, the board should publish on 3 topics, but it actually publishes on only one... I will try to understand better what happen, and will describe better the problem then

pablogs9 commented 3 years ago

Ok, let's keep this issue open for reporting your use case.

Thanks a lot for all the feedback Anaelle.

anaelle-sw commented 3 years ago

Sorry but just a quick question: how can you use domain Ids with the new api? This way doesn't work anymore and throw error rcl_node_options_t {aka struct rcl_node_options_t}' has no member named 'domain_id':

  rcl_node_options_t node_ops = rcl_node_get_default_options();
  node_ops.domain_id = 10;
  CHECK_RC_LOG(rclc_node_init_with_options(&node, "micro_ros_teensy_node", "", &support, &node_ops));

Thanks for helping!

pablogs9 commented 3 years ago

Check: https://github.com/micro-ROS/micro-ROS-demos/blob/300067eaa082b7a85e4f2e648f8bee1623b4b08f/rclc/configuration_example/configured_publisher/main.c#L49

anaelle-sw commented 3 years ago

I reproduced the problem number 3:

With htop, I noticed an agent running in a docker but I was not able to kill the process. After a computer reboot, it was gone. It only happened once.

This agent was launched from the Rolling docker, it was running for an hour, and it consumes 100% of one CPU core. I noticed this with ros2 topic echo, as some "old" topics were still present.

anaelle-sw commented 3 years ago

I was able to reproduce a freeze. It seems to be different from the one we had before, and it could explain why some topics aren't published in our custom application. I modified a bit the minimal publisher example to get this:

#include <micro_ros_arduino.h>
#include <stdio.h>
#include <rcl/rcl.h>
#include <rcl/error_handling.h>
#include <rclc/rclc.h>
#include <rclc/executor.h>
#include <std_msgs/msg/int32.h>

rcl_publisher_t publisher;
std_msgs__msg__Int32 msg;
rclc_executor_t executor;
rclc_support_t support;
rcl_allocator_t allocator;
rcl_node_t node;

#define LED_PIN 13
#define LOOP_RATE 50

#define RCCHECK(fn) { rcl_ret_t temp_rc = fn; if((temp_rc != RCL_RET_OK)){error_loop();}}
#define RCSOFTCHECK(fn) { rcl_ret_t temp_rc = fn; if((temp_rc != RCL_RET_OK)){}}

void error_loop(){
  while(1){
    digitalWrite(LED_PIN, !digitalRead(LED_PIN));
    delay(100);
  }
}

void setup() {
  set_microros_transports();

  pinMode(LED_PIN, OUTPUT);
  digitalWrite(LED_PIN, HIGH);  

  delay(2000);

  allocator = rcl_get_default_allocator();

  //create init_options
  RCCHECK(rclc_support_init(&support, 0, NULL, &allocator));

  // create node
  RCCHECK(rclc_node_init_default(&node, "micro_ros_arduino_node", "", &support));

  // create publisher
  RCCHECK(rclc_publisher_init_best_effort(
    &publisher,
    &node,
    ROSIDL_GET_MSG_TYPE_SUPPORT(std_msgs, msg, Int32),
    "micro_ros_arduino_node_publisher"));

  // create executor
  RCCHECK(rclc_executor_init(&executor, &support.context, 1, &allocator));

  msg.data = 0;
}

void loop() {
  unsigned long start_loop_time = micros();

  RCSOFTCHECK(rclc_executor_spin_some(&executor, RCL_MS_TO_NS(100)));
  RCSOFTCHECK(rcl_publish(&publisher, &msg, NULL));
  msg.data++;

  // If loop rate is less than required, wait necessary time
  while (micros() - start_loop_time < (1000000.0 / LOOP_RATE)) {
    delayMicroseconds(10);
  }
}

You can see that the differences from the publisher example are:

The loop rate is controlled directly inside the loop() (no timer)
The publisher is set up with best effort options

The agent is launched with:

sudo docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:rolling serial --dev /dev/teensy_elodie -v6

What happens is that the first messages (until ~ 55) are published without problem. The output is normal, for each message there is something like this:

[1616577273.477911] debug    | SerialAgentLinux.cpp | recv_message             | [==>> SER <<==]        | client_key: 0x5083E048, len: 16, data: 
0000: 81 01 33 00 07 01 08 00 00 3D 00 05 2F 00 00 00
[1616577273.478034] debug    | DataWriter.cpp     | write                    | [** <<DDS>> **]        | client_key: 0x00000000, len: 4, data: 
0000: 2F 00 00 00
[1616577273.487963] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80

Then, just before freeze, I cannot receive any message and the agent output is:

[1616577273.688039] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577273.888206] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.088323] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.288420] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.488455] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.688621] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.888751] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.088789] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.288883] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.488954] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.689048] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.889266] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.089346] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.289498] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.489501] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.689684] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.889831] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577277.089945] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577277.289980] debug    | SerialAgentLinux.cpp | send_message             | [** <<SER>> **]        | client_key: 0x5083E048, len: 13, data: 
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80

When the process is killed, it takes sometimes to return (~1 minute).

I was able to reproduce this on different computers, but only while having both loop rate control in loop() and best effort publisher. May be it is some band width related problem? Here's my meta file for building if necessary:

{
    "names": {
        "tracetools": {
            "cmake-args": [
                "-DTRACETOOLS_DISABLED=ON",
                "-DTRACETOOLS_STATUS_CHECKING_TOOL=OFF"
            ]
        },
        "rosidl_typesupport": {
            "cmake-args": [
                "-DROSIDL_TYPESUPPORT_SINGLE_TYPESUPPORT=ON"
            ]
        },
        "rcl": {
            "cmake-args": [
                "-DBUILD_TESTING=OFF",
                "-DRCL_COMMAND_LINE_ENABLED=OFF",
                "-DRCL_LOGGING_ENABLED=OFF"
            ]
        },
        "rcutils": {
            "cmake-args": [
                "-DENABLE_TESTING=OFF",
                "-DRCUTILS_NO_FILESYSTEM=ON",
                "-DRCUTILS_NO_THREAD_SUPPORT=ON",
                "-DRCUTILS_NO_64_ATOMIC=ON",
                "-DRCUTILS_AVOID_DYNAMIC_ALLOCATION=ON"
            ]
        },
        "microxrcedds_client": {
            "cmake-args": [
                "-DUCLIENT_PIC=OFF",
                "-DUCLIENT_PROFILE_UDP=OFF",
                "-DUCLIENT_PROFILE_TCP=OFF",
                "-DUCLIENT_PROFILE_DISCOVERY=OFF",
                "-DUCLIENT_PROFILE_SERIAL=OFF",
                "-UCLIENT_PROFILE_STREAM_FRAMING=ON",
                "-DUCLIENT_PROFILE_CUSTOM_TRANSPORT=ON",
                "-DUCLIENT_MAX_SESSION_CONNECTION_ATTEMPTS=3"
            ]
        },
        "rmw_microxrcedds": {
            "cmake-args": [
                "-DRMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT=0",
                "-DRMW_UXRCE_MAX_NODES=1",
                "-DRMW_UXRCE_MAX_PUBLISHERS=3",
                "-DRMW_UXRCE_MAX_SUBSCRIPTIONS=2",
                "-DRMW_UXRCE_MAX_SERVICES=1",
                "-DRMW_UXRCE_MAX_CLIENTS=0",
                "-DRMW_UXRCE_MAX_HISTORY=1",
                "-DRMW_UXRCE_TRANSPORT=custom"
            ]
        }
    }
}

Thanks a lot for your help!

Edit: I also tried to add these options to the meta file but the result is the same:

"-DUCLIENT_SERIAL_TRANSPORT_MTU=128"
"-DRMW_UXRCE_STREAM_HISTORY=5"

Edit 2: Actually, if the timer from the publisher example is passed to 20 milliseconds (to get a 50Hz loop as we would like to), the freeze doesn't happen. So it may not be a bandwidth problem as I first thought, but something related to the use of micros(). Sorry maybe I did something wrong for controlling loop rate...

pablogs9 commented 3 years ago

Ok, I'm going to replicate this.

pablogs9 commented 3 years ago

Questions:

I'm wondering how do you connect the micro-ROS client on the Teensy to the Agent using the USB serial port enabled by the Teensy? Or are you using some pins to connect a Serial to USB adapter?
Can you describe the computer where you are running the Agent? I'm seing the performance drop (100% CPU) on my machine because an Agent serial port handling issue (a while loop going so fast), but I guess that maybe if you are running the agent in a less capable CPU it could freeze...

anaelle-sw commented 3 years ago

1 - Yes we use the already provided USB port of the Teensy 2 - The Agent uses only 3% CPU on my computer ( i7 10th gen)

anaelle-sw commented 3 years ago

I'm seing the performance drop (100% CPU) on my machine because an Agent serial port handling issue

This happens sometimes on my computer when an agent cannot be killed, and I have to reboot. Last time it happens was yesterday night. During ~ 20 tests launching/killing the agent this morning, it did not happen again

pablogs9 commented 3 years ago

Ok so:

Keep me updated regarding this 100% and freezing behavior of the agent. If it happens again we should investigate, if not I guess that is resolved by the library issue.
Regarding the code here I'm testing and also having the same behavior but I think that it is a client issue.

pablogs9 commented 3 years ago

Well regarding the issue related to the code above, that's true that the client and agent suddenly stop the communication. Let me explain some points:

micro-ROS by default open 4 XRCE streams: input reliable / input best-effort / output reliable / output best-effort
if you set DRMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT=0 you are telling micro-ROS to create entities using the output best-effort stream
if you create a publisher using rclc_publisher_init_best_effort you are going to create a pub that uses the output best-effort stream to send the topic data when rcl_publish is called.
So, neither input reliable nor output reliable are being used. And the XRCE session is not being run, because of the use of the best-effort streams, just data is being flushed out.
You are also running an executor with rclc_executor_spin_some but this executor is empty, so it is not interacting with the middleware because it does not have to...
Finally, the XRCE Client and XRCE Agent and continuously checking the status using heartbeats and ACKs (both of them are configurable), but this happens when the XRCE session is ran. And in your example, micro-ROS does not need to run the session because it does not have any reliable thing nor any subscription.

So what is happening here is that the client needs to run the session sometimes. As a temporal solution you can add this line:

void loop() {
  unsigned long start_loop_time = micros();

  RCSOFTCHECK(rclc_executor_spin_some(&executor, RCL_MS_TO_NS(100)));
  RCSOFTCHECK(rcl_publish(&publisher, &msg, NULL));
  msg.data++;

// Run the XRCE session a bit
rmw_wait(NULL, NULL, NULL, NULL, NULL, NULL, &((rmw_time_t){.sec = 0, .nsec = 1e6}));

  // If loop rate is less than required, wait necessary time
  while (micros() - start_loop_time < (1000000.0 / LOOP_RATE)) {
    delayMicroseconds(10);
  }
}

In any case, I'm going to investigate this because the user should not be in charge of handling the XRCE communication. So, I will come back con a better solution.

pablogs9 commented 3 years ago

I have found that the problem with the heartbeats is that if the client doesn't do a rmw_wait it is not consuming the serial data that it receives from the agent. So, if the agent is sending heartbeats and the client is not consuming the Teensy freezes...

This solution also works:

void loop() {
  unsigned long start_loop_time = micros();

  RCSOFTCHECK(rclc_executor_spin_some(&executor, RCL_MS_TO_NS(100)));
  RCSOFTCHECK(rcl_publish(&publisher, &msg, NULL));
  msg.data++;

  while(Serial.available() > 0) {
     char t = Serial.read();
  }

  // If loop rate is less than required, wait necessary time
  while (micros() - start_loop_time < (1000000.0 / LOOP_RATE)) {
    delayMicroseconds(10);
  }
}

But not reading from the serial port should not freeze the board...

pablogs9 commented 3 years ago

Also, when the agent does not respond to a control+C when you try to close it, if I disconnect the Teensy board USB it quits, it should be something related to an Agent thread waiting to close the serial port.

Let me know if this also works for you.

anaelle-sw commented 3 years ago

Thanks for the explanations! Both solutions work for me. Unfortunately, it didn't solve the problem of some topics not being published... I will give more details here when the problem will be isolated

when the agent does not respond to a control+C when you try to close it, if I disconnect the Teensy board USB it quits,

It works for me too!

pablogs9 commented 3 years ago

I'll wait until the topic problem is isolated to test it here. Let me know when you have it.

anaelle-sw commented 3 years ago

Good morning @pablogs9!

Isolating the problem is more complex than I first thought... It seems to be especially linked to our use case, as we use serial communication between the Teensy and some sonars and a battery management system. It used to work nicely before, with the exact same serial communications. The thing is that when both of these serial com are shunted and fake messages are published instead, the topics are well published, at the desired frequency. What I really don't get for now is that the behavior is not constant if the agent is launched several times in a raw without modifying the Teensy code... Sometimes the agent freezes, other times the topic for sonar messages isn't published... Well, I will let you know when it will be more precise.

By the way, I tried to pull the latest version of micro-ROS Arduino this morning, but there is a build error:

cd ~/Arduino/libraries
git clone -b main https://github.com/micro-ROS/micro_ros_arduino.git
cd ~/Arduino/libraries/micro_ros_arduino
sudo docker pull microros/micro_ros_arduino_builder:rolling
sudo docker run -it --rm -v $(pwd):/arduino_project microros/micro_ros_arduino_builder:rolling -p teensy32

This last command failed:

Starting >>> rclc_lifecycle
--- stderr: rclc_lifecycle                                                                                                                                                                           
CMake Warning at /uros_ws/firmware/mcu_ws/install/share/rcutils/cmake/ament_cmake_export_libraries-extras.cmake:116 (message):
  Package 'rcutils' exports library 'dl' which couldn't be found
Call Stack (most recent call first):
  /uros_ws/firmware/mcu_ws/install/share/rcutils/cmake/rcutilsConfig.cmake:41 (include)
  /uros_ws/firmware/mcu_ws/install/share/rosidl_runtime_c/cmake/ament_cmake_export_dependencies-extras.cmake:21 (find_package)
  /uros_ws/firmware/mcu_ws/install/share/rosidl_runtime_c/cmake/rosidl_runtime_cConfig.cmake:41 (include)
  /uros_ws/firmware/mcu_ws/install/share/lifecycle_msgs/cmake/ament_cmake_export_dependencies-extras.cmake:21 (find_package)
  /uros_ws/firmware/mcu_ws/install/share/lifecycle_msgs/cmake/lifecycle_msgsConfig.cmake:41 (include)
  CMakeLists.txt:11 (find_package)

/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c: In function 'rclc_make_node_a_lifecycle_node':
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:50:5: warning: passing argument 9 of 'rcl_lifecycle_state_machine_init' makes pointer from integer without a cast [-Wint-conversion]
     true,
     ^
In file included from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/include/rclc_lifecycle/rclc_lifecycle.h:24:0,
                 from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:17:
/uros_ws/firmware/mcu_ws/install/include/rcl_lifecycle/rcl_lifecycle.h:244:1: note: expected 'const rcl_lifecycle_state_machine_options_t * {aka const struct rcl_lifecycle_state_machine_options_t *}' but argument is of type 'int'
 rcl_lifecycle_state_machine_init(
 ^
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:41:23: error: too many arguments to function 'rcl_lifecycle_state_machine_init'
   rcl_ret_t rcl_ret = rcl_lifecycle_state_machine_init(
                       ^
In file included from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/include/rclc_lifecycle/rclc_lifecycle.h:24:0,
                 from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:17:
/uros_ws/firmware/mcu_ws/install/include/rcl_lifecycle/rcl_lifecycle.h:244:1: note: declared here
 rcl_lifecycle_state_machine_init(
 ^
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c: In function 'rcl_lifecycle_node_fini':
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:233:13: error: too many arguments to function 'rcl_lifecycle_state_machine_fini'
   rcl_ret = rcl_lifecycle_state_machine_fini(
             ^
In file included from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/include/rclc_lifecycle/rclc_lifecycle.h:24:0,
                 from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:17:
/uros_ws/firmware/mcu_ws/install/include/rcl_lifecycle/rcl_lifecycle.h:278:1: note: declared here
 rcl_lifecycle_state_machine_fini(
 ^
make[2]: *** [CMakeFiles/rclc_lifecycle.dir/build.make:66: CMakeFiles/rclc_lifecycle.dir/src/rclc_lifecycle/rclc_lifecycle.c.obj] Error 1
make[1]: *** [CMakeFiles/Makefile2:108: CMakeFiles/rclc_lifecycle.dir/all] Error 2
make: *** [Makefile:144: all] Error 2
---
Failed   <<< rclc_lifecycle [1.12s, exited with code 2]

Thanks for your help!

pablogs9 commented 3 years ago

Hello @anaelle-sw, this is an RCLC problem because they have not updated their RCL API usage in Rolling... These are the changes: https://github.com/ros2/rcl/commit/e9b588d1ed4ea11bd667c25fb35e231883cadc90

Could you please open an issue in RCLC copying this same error, I will assign the mantainer of that. https://github.com/ros2/rclc

anaelle-sw commented 3 years ago

Hi @pablogs9 I am still trying to de-bug the unstable communication between the agent and our ROS2 application. In the meantime, found out that our ROS2 application has a different behavior depending on the DDS vendor we use. My colleagues and I would like to compare the behavior of the whole robot using FastDDS and using CycloneDDS, so we can have a discussion on which DDS we will use. So I would need to launch the agent while forcing the RMW implementation to one or another.

Is there a way to launch the agent from the docker (sudo docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:rolling serial --dev /dev/teensy_elodie -v6) and force some DDS vendor? There is the option --middleware but it doesn't allow the usage of CycloneDDS, that's right? If it is not possible I will build the agent from sources, set the environment variable RWM_IMPLEMENTATION, and launch it with ros2 run. Is there known issues about the micro-ROS agent using CycloneDDS?

Thanks for your help!

pablogs9 commented 3 years ago

Hello @anaelle-sw, sorry but we don't provide a micro-ROS Agent that uses CycloneDDS in any case. So, as micro-ROS Agent relies on eProsima's Micro XRCE-DDS Agent, it will always use eProsima's FastDDS as middleware.

In this case, if the rest of the system uses another DDS vendor, we should rely on DDS inter compatibility.

anaelle-sw commented 3 years ago

Thanks for your answer! Knowing this, I feel that the unstable communication between the agent and ROS2 may come from using CycloneDDS and FastDDS, as we are currently using CycloneDDS by default. To confirm this, we need to use FastDDS and get the same behavior than with CycloneDDS. But we have some troubles to use FastDDS with our ROS2 application for now, because of callback groups (which are used widely in our use case). So as it might take time, I guess this issue can be closed for the moment. I will open it again if some similar problems happen when using FastDDS on both (micro-ROS agent and ROS2 application) sides.

ralph-lange commented 3 years ago

I do not understand the relation between the DDS implementation and the callback groups. Note that you may choose between FastDDS and CycloneDDS in both distributions, Foxy and Rolling, by just an environment variable, cf. https://index.ros.org/doc/ros2/Tutorials/Working-with-multiple-RMW-implementations/

anaelle-sw commented 3 years ago

Yes, I am using an environment variable to switch from one to another vendor. All I know is that when launching some simple nodes with callback groups, and using FastDDS, the behavior is different depending on where the callback group executor is spinned. Which doesn't happen with CycloneDDS. But for now, I don't understand neither what is really causing this.

ralph-lange commented 3 years ago

That's an interesting detail about the callback group executor that I didn't know. Have you already opened an issue on this in https://github.com/ros2/rclcpp/?

anaelle-sw commented 3 years ago

Not yet, but I surely will. I am trying to understand more precisely the actual problem before

micro-ROS / micro_ros_arduino