Closed anaelle-sw closed 3 years ago
I'm trying to reproduce this but:
docker pull microros/micro-ros-agent:foxy
docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:foxy serial --dev /dev/serial/by-id/usb-Teensyduino_USB_Serial_8180040-if00 -v6
[1616400578.002739] info | TermiosAgentLinux.cpp | init | running... | fd: 3
[1616400578.002956] info | Root.cpp | set_verbose_level | logger setup | verbose_level: 6
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Could you reproduce this @jamoralp @Acuadros95? Can we upload the bad_alloc fix asap?
Hello @anaelle-sw could you confirm if the agent behaves in the same way with the new docker images:
docker pull microros/micro-ros-agent:foxy
docker pull microros/micro-ros-agent:rolling
We have found that the bad_alloc
problem is no the only one that appears because of the versioning problem... @Acuadros95 has reported that he also has had agents frozen because of this error...
Let us know if you can replicate the behavior detailed here with these new images, if so, I will replicate your scenario here with our Teensy boards.
Thanks and sorry for the delay.
Hi @pablogs9
I still have the bad_alloc
issue, when launching a lot of ROS2 nodes on behalf on the agent, but not right after launching the agent as it used to do sometimes.
To test the issues listed before, I loaded minimal publisher example on the board. I tried to use the Rolling and the Foxy dockers, and to let the RMW implementation to CycloneDDS or to force it to FastRTPS. With all of these environments I got the same results:
htop
, I can see there is a process named micro_ros_agent
that has been running for 30 minutes, using 100% of one of the CPU cores, and that I cannot kill.So yes, unfortunately the errors number 1, 2, and 3 from the first post are still happening with the new images.
Some points:
docker pull
on foxy
and rolling
imagesbad_alloc
under rolling, because it still on 2.0.0, I'm updating this.foxy
. Can you verify if using the foxy image (even with the rest of the system on rolling), you are having the bad_alloc
?About the bad_alloc
issue, you are right: it only happened with the Rolling image.
But the freezing problems happen on both images
Can you confirm how much CPU does the agent consumes during normal operation? Before it freezes.
With Rolling docker, it uses less than 3% of one CPU core. Do you want me to test it with Foxy dock as well?
Yes please, just a measure with htop
Well with Foxy docker, htop gives around 2.6% CPU usage
And, if you update the rolling image as stated in https://github.com/micro-ROS/micro_ros_arduino/issues/40, to the newest rolling image (where I hope have fixed the bad_alloc as in Foxy docker) are you still having the freezes?
I want to be sure that both problems are not related. Sorry for being so exhaustive.
No problem. I try also try on both different setups we can use (dev computers and robots), to be sure the results are consistent, as the bad_alloc
issue used to mainly happen on the robot.
So, the Rolling docker doesn't crash with error bad_alloc
anymore when launching several ROS2 nodes on behalf of the agent!
For the moment, I am not able to reproduce the freeze problem... I will let you know if it happens again.
However, the Rolling docker presents other problems. For instance, on our custom application, the board should publish on 3 topics, but it actually publishes on only one... I will try to understand better what happen, and will describe better the problem then
Ok, let's keep this issue open for reporting your use case.
Thanks a lot for all the feedback Anaelle.
Sorry but just a quick question: how can you use domain Ids with the new api? This way doesn't work anymore and throw error rcl_node_options_t {aka struct rcl_node_options_t}' has no member named 'domain_id'
:
rcl_node_options_t node_ops = rcl_node_get_default_options();
node_ops.domain_id = 10;
CHECK_RC_LOG(rclc_node_init_with_options(&node, "micro_ros_teensy_node", "", &support, &node_ops));
Thanks for helping!
I reproduced the problem number 3:
With
htop
, I noticed an agent running in a docker but I was not able to kill the process. After a computer reboot, it was gone. It only happened once.
This agent was launched from the Rolling docker, it was running for an hour, and it consumes 100% of one CPU core. I noticed this with ros2 topic echo
, as some "old" topics were still present.
I was able to reproduce a freeze. It seems to be different from the one we had before, and it could explain why some topics aren't published in our custom application. I modified a bit the minimal publisher example to get this:
#include <micro_ros_arduino.h>
#include <stdio.h>
#include <rcl/rcl.h>
#include <rcl/error_handling.h>
#include <rclc/rclc.h>
#include <rclc/executor.h>
#include <std_msgs/msg/int32.h>
rcl_publisher_t publisher;
std_msgs__msg__Int32 msg;
rclc_executor_t executor;
rclc_support_t support;
rcl_allocator_t allocator;
rcl_node_t node;
#define LED_PIN 13
#define LOOP_RATE 50
#define RCCHECK(fn) { rcl_ret_t temp_rc = fn; if((temp_rc != RCL_RET_OK)){error_loop();}}
#define RCSOFTCHECK(fn) { rcl_ret_t temp_rc = fn; if((temp_rc != RCL_RET_OK)){}}
void error_loop(){
while(1){
digitalWrite(LED_PIN, !digitalRead(LED_PIN));
delay(100);
}
}
void setup() {
set_microros_transports();
pinMode(LED_PIN, OUTPUT);
digitalWrite(LED_PIN, HIGH);
delay(2000);
allocator = rcl_get_default_allocator();
//create init_options
RCCHECK(rclc_support_init(&support, 0, NULL, &allocator));
// create node
RCCHECK(rclc_node_init_default(&node, "micro_ros_arduino_node", "", &support));
// create publisher
RCCHECK(rclc_publisher_init_best_effort(
&publisher,
&node,
ROSIDL_GET_MSG_TYPE_SUPPORT(std_msgs, msg, Int32),
"micro_ros_arduino_node_publisher"));
// create executor
RCCHECK(rclc_executor_init(&executor, &support.context, 1, &allocator));
msg.data = 0;
}
void loop() {
unsigned long start_loop_time = micros();
RCSOFTCHECK(rclc_executor_spin_some(&executor, RCL_MS_TO_NS(100)));
RCSOFTCHECK(rcl_publish(&publisher, &msg, NULL));
msg.data++;
// If loop rate is less than required, wait necessary time
while (micros() - start_loop_time < (1000000.0 / LOOP_RATE)) {
delayMicroseconds(10);
}
}
You can see that the differences from the publisher example are:
loop()
(no timer)The agent is launched with:
sudo docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:rolling serial --dev /dev/teensy_elodie -v6
What happens is that the first messages (until ~ 55) are published without problem. The output is normal, for each message there is something like this:
[1616577273.477911] debug | SerialAgentLinux.cpp | recv_message | [==>> SER <<==] | client_key: 0x5083E048, len: 16, data:
0000: 81 01 33 00 07 01 08 00 00 3D 00 05 2F 00 00 00
[1616577273.478034] debug | DataWriter.cpp | write | [** <<DDS>> **] | client_key: 0x00000000, len: 4, data:
0000: 2F 00 00 00
[1616577273.487963] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
Then, just before freeze, I cannot receive any message and the agent output is:
[1616577273.688039] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577273.888206] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.088323] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.288420] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.488455] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.688621] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577274.888751] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.088789] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.288883] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.488954] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.689048] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577275.889266] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.089346] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.289498] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.489501] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.689684] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577276.889831] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577277.089945] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
[1616577277.289980] debug | SerialAgentLinux.cpp | send_message | [** <<SER>> **] | client_key: 0x5083E048, len: 13, data:
0000: 81 00 00 00 0B 01 05 00 00 00 03 00 80
When the process is killed, it takes sometimes to return (~1 minute).
I was able to reproduce this on different computers, but only while having both loop rate control in loop()
and best effort publisher. May be it is some band width related problem? Here's my meta file for building if necessary:
{
"names": {
"tracetools": {
"cmake-args": [
"-DTRACETOOLS_DISABLED=ON",
"-DTRACETOOLS_STATUS_CHECKING_TOOL=OFF"
]
},
"rosidl_typesupport": {
"cmake-args": [
"-DROSIDL_TYPESUPPORT_SINGLE_TYPESUPPORT=ON"
]
},
"rcl": {
"cmake-args": [
"-DBUILD_TESTING=OFF",
"-DRCL_COMMAND_LINE_ENABLED=OFF",
"-DRCL_LOGGING_ENABLED=OFF"
]
},
"rcutils": {
"cmake-args": [
"-DENABLE_TESTING=OFF",
"-DRCUTILS_NO_FILESYSTEM=ON",
"-DRCUTILS_NO_THREAD_SUPPORT=ON",
"-DRCUTILS_NO_64_ATOMIC=ON",
"-DRCUTILS_AVOID_DYNAMIC_ALLOCATION=ON"
]
},
"microxrcedds_client": {
"cmake-args": [
"-DUCLIENT_PIC=OFF",
"-DUCLIENT_PROFILE_UDP=OFF",
"-DUCLIENT_PROFILE_TCP=OFF",
"-DUCLIENT_PROFILE_DISCOVERY=OFF",
"-DUCLIENT_PROFILE_SERIAL=OFF",
"-UCLIENT_PROFILE_STREAM_FRAMING=ON",
"-DUCLIENT_PROFILE_CUSTOM_TRANSPORT=ON",
"-DUCLIENT_MAX_SESSION_CONNECTION_ATTEMPTS=3"
]
},
"rmw_microxrcedds": {
"cmake-args": [
"-DRMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT=0",
"-DRMW_UXRCE_MAX_NODES=1",
"-DRMW_UXRCE_MAX_PUBLISHERS=3",
"-DRMW_UXRCE_MAX_SUBSCRIPTIONS=2",
"-DRMW_UXRCE_MAX_SERVICES=1",
"-DRMW_UXRCE_MAX_CLIENTS=0",
"-DRMW_UXRCE_MAX_HISTORY=1",
"-DRMW_UXRCE_TRANSPORT=custom"
]
}
}
}
Thanks a lot for your help!
Edit: I also tried to add these options to the meta file but the result is the same:
"-DUCLIENT_SERIAL_TRANSPORT_MTU=128"
"-DRMW_UXRCE_STREAM_HISTORY=5"
Edit 2: Actually, if the timer from the publisher example is passed to 20 milliseconds (to get a 50Hz loop as we would like to), the freeze doesn't happen. So it may not be a bandwidth problem as I first thought, but something related to the use of micros()
. Sorry maybe I did something wrong for controlling loop rate...
Ok, I'm going to replicate this.
Questions:
1 - Yes we use the already provided USB port of the Teensy 2 - The Agent uses only 3% CPU on my computer ( i7 10th gen)
I'm seing the performance drop (100% CPU) on my machine because an Agent serial port handling issue
This happens sometimes on my computer when an agent cannot be killed, and I have to reboot. Last time it happens was yesterday night. During ~ 20 tests launching/killing the agent this morning, it did not happen again
Ok so:
Well regarding the issue related to the code above, that's true that the client and agent suddenly stop the communication. Let me explain some points:
DRMW_UXRCE_ENTITY_CREATION_DESTROY_TIMEOUT=0
you are telling micro-ROS to create entities using the output best-effort stream rclc_publisher_init_best_effort
you are going to create a pub that uses the output best-effort stream to send the topic data when rcl_publish
is called.rclc_executor_spin_some
but this executor is empty, so it is not interacting with the middleware because it does not have to...So what is happening here is that the client needs to run the session sometimes. As a temporal solution you can add this line:
void loop() {
unsigned long start_loop_time = micros();
RCSOFTCHECK(rclc_executor_spin_some(&executor, RCL_MS_TO_NS(100)));
RCSOFTCHECK(rcl_publish(&publisher, &msg, NULL));
msg.data++;
// Run the XRCE session a bit
rmw_wait(NULL, NULL, NULL, NULL, NULL, NULL, &((rmw_time_t){.sec = 0, .nsec = 1e6}));
// If loop rate is less than required, wait necessary time
while (micros() - start_loop_time < (1000000.0 / LOOP_RATE)) {
delayMicroseconds(10);
}
}
In any case, I'm going to investigate this because the user should not be in charge of handling the XRCE communication. So, I will come back con a better solution.
I have found that the problem with the heartbeats is that if the client doesn't do a rmw_wait
it is not consuming the serial data that it receives from the agent. So, if the agent is sending heartbeats and the client is not consuming the Teensy freezes...
This solution also works:
void loop() {
unsigned long start_loop_time = micros();
RCSOFTCHECK(rclc_executor_spin_some(&executor, RCL_MS_TO_NS(100)));
RCSOFTCHECK(rcl_publish(&publisher, &msg, NULL));
msg.data++;
while(Serial.available() > 0) {
char t = Serial.read();
}
// If loop rate is less than required, wait necessary time
while (micros() - start_loop_time < (1000000.0 / LOOP_RATE)) {
delayMicroseconds(10);
}
}
But not reading from the serial port should not freeze the board...
Also, when the agent does not respond to a control+C when you try to close it, if I disconnect the Teensy board USB it quits, it should be something related to an Agent thread waiting to close the serial port.
Let me know if this also works for you.
Thanks for the explanations! Both solutions work for me. Unfortunately, it didn't solve the problem of some topics not being published... I will give more details here when the problem will be isolated
when the agent does not respond to a control+C when you try to close it, if I disconnect the Teensy board USB it quits,
It works for me too!
I'll wait until the topic problem is isolated to test it here. Let me know when you have it.
Good morning @pablogs9!
Isolating the problem is more complex than I first thought... It seems to be especially linked to our use case, as we use serial communication between the Teensy and some sonars and a battery management system. It used to work nicely before, with the exact same serial communications. The thing is that when both of these serial com are shunted and fake messages are published instead, the topics are well published, at the desired frequency. What I really don't get for now is that the behavior is not constant if the agent is launched several times in a raw without modifying the Teensy code... Sometimes the agent freezes, other times the topic for sonar messages isn't published... Well, I will let you know when it will be more precise.
By the way, I tried to pull the latest version of micro-ROS Arduino this morning, but there is a build error:
cd ~/Arduino/libraries
git clone -b main https://github.com/micro-ROS/micro_ros_arduino.git
cd ~/Arduino/libraries/micro_ros_arduino
sudo docker pull microros/micro_ros_arduino_builder:rolling
sudo docker run -it --rm -v $(pwd):/arduino_project microros/micro_ros_arduino_builder:rolling -p teensy32
This last command failed:
Starting >>> rclc_lifecycle
--- stderr: rclc_lifecycle
CMake Warning at /uros_ws/firmware/mcu_ws/install/share/rcutils/cmake/ament_cmake_export_libraries-extras.cmake:116 (message):
Package 'rcutils' exports library 'dl' which couldn't be found
Call Stack (most recent call first):
/uros_ws/firmware/mcu_ws/install/share/rcutils/cmake/rcutilsConfig.cmake:41 (include)
/uros_ws/firmware/mcu_ws/install/share/rosidl_runtime_c/cmake/ament_cmake_export_dependencies-extras.cmake:21 (find_package)
/uros_ws/firmware/mcu_ws/install/share/rosidl_runtime_c/cmake/rosidl_runtime_cConfig.cmake:41 (include)
/uros_ws/firmware/mcu_ws/install/share/lifecycle_msgs/cmake/ament_cmake_export_dependencies-extras.cmake:21 (find_package)
/uros_ws/firmware/mcu_ws/install/share/lifecycle_msgs/cmake/lifecycle_msgsConfig.cmake:41 (include)
CMakeLists.txt:11 (find_package)
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c: In function 'rclc_make_node_a_lifecycle_node':
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:50:5: warning: passing argument 9 of 'rcl_lifecycle_state_machine_init' makes pointer from integer without a cast [-Wint-conversion]
true,
^
In file included from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/include/rclc_lifecycle/rclc_lifecycle.h:24:0,
from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:17:
/uros_ws/firmware/mcu_ws/install/include/rcl_lifecycle/rcl_lifecycle.h:244:1: note: expected 'const rcl_lifecycle_state_machine_options_t * {aka const struct rcl_lifecycle_state_machine_options_t *}' but argument is of type 'int'
rcl_lifecycle_state_machine_init(
^
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:41:23: error: too many arguments to function 'rcl_lifecycle_state_machine_init'
rcl_ret_t rcl_ret = rcl_lifecycle_state_machine_init(
^
In file included from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/include/rclc_lifecycle/rclc_lifecycle.h:24:0,
from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:17:
/uros_ws/firmware/mcu_ws/install/include/rcl_lifecycle/rcl_lifecycle.h:244:1: note: declared here
rcl_lifecycle_state_machine_init(
^
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c: In function 'rcl_lifecycle_node_fini':
/uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:233:13: error: too many arguments to function 'rcl_lifecycle_state_machine_fini'
rcl_ret = rcl_lifecycle_state_machine_fini(
^
In file included from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/include/rclc_lifecycle/rclc_lifecycle.h:24:0,
from /uros_ws/firmware/mcu_ws/uros/rclc/rclc_lifecycle/src/rclc_lifecycle/rclc_lifecycle.c:17:
/uros_ws/firmware/mcu_ws/install/include/rcl_lifecycle/rcl_lifecycle.h:278:1: note: declared here
rcl_lifecycle_state_machine_fini(
^
make[2]: *** [CMakeFiles/rclc_lifecycle.dir/build.make:66: CMakeFiles/rclc_lifecycle.dir/src/rclc_lifecycle/rclc_lifecycle.c.obj] Error 1
make[1]: *** [CMakeFiles/Makefile2:108: CMakeFiles/rclc_lifecycle.dir/all] Error 2
make: *** [Makefile:144: all] Error 2
---
Failed <<< rclc_lifecycle [1.12s, exited with code 2]
Thanks for your help!
Hello @anaelle-sw, this is an RCLC problem because they have not updated their RCL API usage in Rolling... These are the changes: https://github.com/ros2/rcl/commit/e9b588d1ed4ea11bd667c25fb35e231883cadc90
Could you please open an issue in RCLC copying this same error, I will assign the mantainer of that. https://github.com/ros2/rclc
Hi @pablogs9 I am still trying to de-bug the unstable communication between the agent and our ROS2 application. In the meantime, found out that our ROS2 application has a different behavior depending on the DDS vendor we use. My colleagues and I would like to compare the behavior of the whole robot using FastDDS and using CycloneDDS, so we can have a discussion on which DDS we will use. So I would need to launch the agent while forcing the RMW implementation to one or another.
Is there a way to launch the agent from the docker (sudo docker run -it --rm --net=host -v /dev:/dev --privileged microros/micro-ros-agent:rolling serial --dev /dev/teensy_elodie -v6
) and force some DDS vendor? There is the option --middleware
but it doesn't allow the usage of CycloneDDS, that's right?
If it is not possible I will build the agent from sources, set the environment variable RWM_IMPLEMENTATION
, and launch it with ros2 run
. Is there known issues about the micro-ROS agent using CycloneDDS?
Thanks for your help!
Hello @anaelle-sw, sorry but we don't provide a micro-ROS Agent that uses CycloneDDS in any case. So, as micro-ROS Agent relies on eProsima's Micro XRCE-DDS Agent, it will always use eProsima's FastDDS as middleware.
In this case, if the rest of the system uses another DDS vendor, we should rely on DDS inter compatibility.
Thanks for your answer! Knowing this, I feel that the unstable communication between the agent and ROS2 may come from using CycloneDDS and FastDDS, as we are currently using CycloneDDS by default. To confirm this, we need to use FastDDS and get the same behavior than with CycloneDDS. But we have some troubles to use FastDDS with our ROS2 application for now, because of callback groups (which are used widely in our use case). So as it might take time, I guess this issue can be closed for the moment. I will open it again if some similar problems happen when using FastDDS on both (micro-ROS agent and ROS2 application) sides.
I do not understand the relation between the DDS implementation and the callback groups. Note that you may choose between FastDDS and CycloneDDS in both distributions, Foxy and Rolling, by just an environment variable, cf. https://index.ros.org/doc/ros2/Tutorials/Working-with-multiple-RMW-implementations/
Yes, I am using an environment variable to switch from one to another vendor. All I know is that when launching some simple nodes with callback groups, and using FastDDS, the behavior is different depending on where the callback group executor is spinned. Which doesn't happen with CycloneDDS. But for now, I don't understand neither what is really causing this.
That's an interesting detail about the callback group executor that I didn't know. Have you already opened an issue on this in https://github.com/ros2/rclcpp/?
Not yet, but I surely will. I am trying to understand more precisely the actual problem before
Hello micro-ROS team! I have some trouble to open a stable session between my board and the micro-ROS agent, when launched from Foxy docker.
Setup
Steps to reproduce
Build micro-ROS Arduino for Foxy:
The meta file I use is just a bit modified in order to be able to use re-connection features:
Then I launch the micro-ROS agent with the board connected:
Then I upload some examples on the board, for instance re-connection example or the publisher example.
Issue description
I noticed that the behavior of the agent is really unstable, it would often do one of the following: 1 - I sometimes cannot open a session. In this case, the output is just:
I would try several times to reload the code on the board, or to re-launch the agent. It can take a dozen of tries before the session is established. It especially happens with the publisher example. Is it possible that the custom meta file is interfering with applications that don't use the re-connection features?
2 - Once the session is open, everything run smoothly, the agent output shows all the messages being published and I can get them with
ros2 topic echo
. But after a while, the agent just freezes and no message are passed anymore. There is no error, but the agent output is stopped. May be the first case is also some kind agent freeze.3 - After a few tries trying to launch the agent, I noticed with
ros2 topic node
that the node running on the board was present twice. Withhtop
, I noticed an agent running in a docker but I was not able to kill the process. After a computer reboot, it was gone. It only happened once.4 - I encountered more "specific" errors. For instance: the re-connection example doesn't always work, or a custom node only publishes one topic instead of three. But it may be related to the previous cases, so I can detail this later on if it is still relevant.
Sorry it is a bit vague. Both (1) and (2) cases are really easy to reproduce since it happens really frequently, but the freeze seems to happen randomly and I was not able to find more precisely what can cause this.
Thanks a lot for support!