Closed daisukes closed 4 years ago
Services are sometimes not responded.
could you define the problem what you wanna fix?
you are mentioning the reliability, so after making sure service availability, service sometimes does not respond? right?
Yes. After making sure service availability, service sometimes does not respond.
Specifically, I want clear costmap services work well.
@daisukes
thanks, I would like to see the stack trace or core when exactly this hang up
happens. just to make sure this happens in the ros2 system core stack or user implementation(in this case, clear costmap services). can you make this happen with other service example such as https://github.com/ros2/demos/tree/master/demo_nodes_cpp/src/services?
After making sure the navigation2 system is working, I run a small program listed below to call a service (/global_costmap/clear_entirely_global_costmap
) and stopped after waiting 15 secs and got the backtrace.
Also, run this snippet with 3-sec timeout hundreds of times. It randomly successes about 5% on my PC with Core i7-9750H @ 2.60GHz and about 70% on a server with Core i9-7920X @ 2.90GHz
Example works perfectly.
$ ros2 run --prefix 'gdb -ex run --args' service_test service_test
...
Starting program: /home/developer/ros2_ws/install/service_test/lib/service_test/service_test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6760700 (LWP 15611)]
[New Thread 0x7ffff5f5f700 (LWP 15612)]
[New Thread 0x7ffff575e700 (LWP 15613)]
[New Thread 0x7ffff4f5d700 (LWP 15614)]
[New Thread 0x7ffff475c700 (LWP 15615)]
[New Thread 0x7ffff3e9b700 (LWP 15616)]
[New Thread 0x7ffff369a700 (LWP 15617)]
[New Thread 0x7ffff2d6c700 (LWP 15618)]
^C
Thread 1 "service_test" received signal SIGINT, Interrupt.
futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffff45e0, clockid=<optimized out>, expected=0, futex_word=0x5555556d7e28)
at ../sysdeps/nptl/futex-internal.h:320
320 ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0 futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffff45e0, clockid=<optimized out>, expected=0, futex_word=0x5555556d7e28)
at ../sysdeps/nptl/futex-internal.h:320
#1 __pthread_cond_wait_common (abstime=0x7fffffff45e0, clockid=<optimized out>, mutex=0x5555556d7dd8, cond=0x5555556d7e00)
at pthread_cond_wait.c:520
#2 __pthread_cond_timedwait (cond=0x5555556d7e00, mutex=0x5555556d7dd8, abstime=0x7fffffff45e0) at pthread_cond_wait.c:656
#3 0x00007ffff7cc6a7f in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#4 0x00007ffff7ccc9c4 in std::cv_status std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#5 0x00007ffff7ccab80 in std::cv_status std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#6 0x00007ffff7d351fe in bool std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#7 0x00007ffff7d33d7d in bool std::condition_variable::wait_for<long, std::ratio<1l, 1000000000l>, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#8 0x00007ffff7d3373c in rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#9 0x00007ffff7cbf0fe in rclcpp::ClientBase::wait_for_service_nanoseconds(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()
from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#10 0x000055555556da4c in bool rclcpp::ClientBase::wait_for_service<long, std::ratio<1l, 1000l> >(std::chrono::duration<long, std::ratio<1l, 1000l> >) ()
#11 0x0000555555569ec0 in main ()
#include "nav2_costmap_2d/clear_costmap_service.hpp"
#include <chrono>
#include <cstdlib>
#include <memory>
using namespace std::chrono_literals;
int main(int argc, char **argv)
{
rclcpp::init(argc, argv);
std::shared_ptr<rclcpp::Node> node = rclcpp::Node::make_shared("clear_costmap_client");
auto client = node->create_client<nav2_msgs::srv::ClearEntireCostmap>("/global_costmap/clear_entirely_global_costmap");
auto request = std::make_shared<nav2_msgs::srv::ClearEntireCostmap::Request>();
if (!client->wait_for_service()) { // change timeout for repeat
if (!rclcpp::ok()) {
RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Interrupted while waiting for the service. Exiting.");
return 0;
}
RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "service not available.");
return 0;
}
auto result = client->async_send_request(request);
// Wait for the result.
if (rclcpp::spin_until_future_complete(node, result) ==
rclcpp::executor::FutureReturnCode::SUCCESS)
{
RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "Succeeded");
} else {
RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Failed");
}
rclcpp::shutdown();
return 0;
}
https://github.com/ros2/ros2/issues/922 looks same with this problem.
Indeed. @daisukes mind to try https://github.com/ros2/rmw_fastrtps/pull/390? It should mitigate the issue significantly.
@fujitatomoya @hidmic Thank you for the information. I have tried that branch of rmw_fastrtps, however, it could not improve much. 15 success out of 213 trial, about 7%.
I found a thing, still not sure how it relates to the problem though. It looks like the success case is determined when you make a node. If you keep using a node that causes failure, I could not get success at all by calling waif_for_server multiple times even though you recreate service clients. However, if you get a node that works fine by chance, it keeps getting success.
FYI @MiguelCompany
according to https://github.com/ros2/rclcpp/issues/1152#issuecomment-637261021, it is just a client stack, thread waiting for the event which is service availability. So as requester, service (looks) IS NOT ready. and the rest is the trace triggered by signal SIGINT.
maybe we could check the service client/server debug log to see the difference? Any thoughts?
Sent requests from the client are sometimes taken by the server, but sometimes not.
planner_server_19898_1591165395685.log service_test_20379_1591165406625.log
service_test program creates a node and client, then call async_send_request
without wait_for_service
.
[DEBUG] [1591165419.638036511] [rcl]: Client initialized
[DEBUG] [1591165419.638107866] [rcl]: Client sending service request
[DEBUG] [1591165419.638214040] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '0' timers, '0' clients, '0' services
[DEBUG] [1591165419.638345176] [rcl]: Waiting with timeout: 0s + 500000000ns
[DEBUG] [1591165419.638375476] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.638397164] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.638415441] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.638495935] [rcl]: Waiting with timeout: 0s + 499806226ns
[DEBUG] [1591165419.638511993] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.721666039] [rcl]: Client in wait set is ready
[DEBUG] [1591165419.721823218] [rcl]: Client taking service response
[DEBUG] [1591165419.722010101] [rcl]: Client take response succeeded: true
[INFO] [1591165419.722545187] [rclcpp]: 0 Successed
[DEBUG] [1591165419.711462786] [rcl]: Subscription taking message
[DEBUG] [1591165419.711502554] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165419.711638796] [rcl]: Waiting without timeout
[DEBUG] [1591165419.711654606] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.711665458] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.711718538] [rcl]: Waiting without timeout
[DEBUG] [1591165419.711729317] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.720513910] [rcl]: Service in wait set is ready
[DEBUG] [1591165419.720638429] [rcl]: Service server taking service request
[DEBUG] [1591165419.720692414] [rcl]: Service take request succeeded: true
[INFO] [1591165419.720715644] [global_costmap.global_costmap]: Received request to clear entirely the global_costmap
[DEBUG] [1591165419.721430447] [rcl]: Sending service response
[DEBUG] [1591165419.721573135] [rcl]: Waiting without timeout
[DEBUG] [1591165419.721592867] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.721614984] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.721699911] [rcl]: Waiting without timeout
[DEBUG] [1591165419.721718055] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.264981778] [rcl]: Client initialized
[DEBUG] [1591165420.265018811] [rcl]: Client sending service request
[DEBUG] [1591165420.265092068] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '0' timers, '0' clients, '0' services
[DEBUG] [1591165420.265168123] [rcl]: Waiting with timeout: 0s + 500000000ns
[DEBUG] [1591165420.265181556] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.265192436] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.265199093] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.265296772] [rcl]: Waiting with timeout: 0s + 499847515ns
[DEBUG] [1591165420.265308321] [rcl]: Timeout calculated based on next scheduled timer: false
[ERROR] [1591165420.765323394] [rclcpp]: 0 Failed
[DEBUG] [1591165420.261266585] [rcl]: Subscription in wait set is ready
[DEBUG] [1591165420.261353890] [rcl]: Subscription taking message
[DEBUG] [1591165420.261383022] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165420.261468070] [rcl]: Waiting without timeout
[DEBUG] [1591165420.261477170] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.261484102] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.261507740] [rcl]: Waiting without timeout
[DEBUG] [1591165420.261516860] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.281158735] [rcl]: Subscription in wait set is ready
[DEBUG] [1591165420.281240421] [rcl]: Subscription taking message
[DEBUG] [1591165420.281267092] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165420.281324523] [rcl]: Waiting without timeout
[DEBUG] [1591165420.281331477] [rcl]: Timeout calculated based on next scheduled timer: false
It works well if I use Fast-RTPS v1.10.0 instead of 2.0.x, which is changed in this April. https://github.com/ros2/ros2/commit/c7c115a7f9d78985903bac05d7dd0d886b808b61
Sent requests from the client are sometimes taken by the server, but sometimes not.
Interesting.
service_test program creates a node and client, then call
async_send_request
withoutwait_for_service
.
I agree that a client hanging isn't a nice way to fail a service call, but if you don't wait for the service server the chances of the response future ever becoming ready are pretty low.
Closing in favor of https://github.com/ros2/rmw_fastrtps/issues/392 since these appear to be duplicates, and rmw_fastrtps
is a more specifc repo
Hello @daisukes ,
Did you eventually solve it? I'm struggling with the same issue (I think it's the same). Some services just don't receive / respond to calls... I'm running ros2 galactic on ubuntu 20.04 with docker as well.
Hello @omertal88,
I have been using ros:galactic image on dockerhub and no preblems so far.
I remember that we used to use Fast-RTPS(Fast-DDS) for foxy and needed to downgrade until the bug was fixed.
The default DDS is changed to CycloneDDS for galactic from Fast-DDS for foxy. https://discourse.ros.org/t/community-feedback-default-rmw-for-ros-2-galactic/16450
Have you checked different DDS implementations?
Wow, so I changed the rmw to use cyclonedds and the issues were gone! The thing is, I have some devices that run micro-ros which only supports Fast-RTPS. I hope I won't experience any issues with those devices. Also, coudln't find documentation about how to configure the discovery of CycloneDDS (I need to use uni-cast with devices that are connected wirelessly). Do you happen to know how?
@daisukes So in fact there are issues when calling services from CycloneDDS to Fast-RTPS. Response isn't returned. I found this open bug: https://github.com/ros2/rmw_cyclonedds/issues/184 and it seems there hasn't been any progress with it. I may try to convert my services to simple publishers + subscribers.
Bug report
*Required Info: Related to https://github.com/ros-planning/navigation2/issues/1788*
Steps to reproduce issue
Expected behavior
Services should be responded.
Actual behavior
Services are sometimes not responded.
With my three environments described here, it could sometimes success and fail. Not sure but, It seems like depending CPU speed. I found this issue while trying to use clear costmap services.
Additional information
I have not checked all services in Navigation2, but at least some of the services are not responded.