Unreliable services - Githubissues

daisukes commented 4 years ago

Bug report

*Required Info: Related to https://github.com/ros-planning/navigation2/issues/1788*

Operating System:
- Ubuntu 20.04 (Docker container)
Installation type:
- Foxy, source build (4772a2a9b926126567008a0d83c5c7323d5c684b)
DDS implementation:
- Fast-RTPS
Client library (if applicable):
- Navigation 2 (master, 13d6629356e897e1717f2fd9579d807bd3c8cfc6)

Steps to reproduce issue

# Launch tb3 simulation and provide an initial location to start lifecycle
$ ros2 launch nav2_bringup tb3_simulation_launch.py

# Then call service like
$ ros2 service call /global_costmap/clear_entirely_global_costmap nav2_msgs/srv/ClearEntireCostmap "{}"

Expected behavior

Services should be responded.

Actual behavior

Services are sometimes not responded.

With my three environments described here, it could sometimes success and fail. Not sure but, It seems like depending CPU speed. I found this issue while trying to use clear costmap services.

Additional information

I have not checked all services in Navigation2, but at least some of the services are not responded.

fujitatomoya commented 4 years ago

Services are sometimes not responded.

could you define the problem what you wanna fix?

service is available, but not responding?
takes long time to bring up the service?

you are mentioning the reliability, so after making sure service availability, service sometimes does not respond? right?

daisukes commented 4 years ago

Yes. After making sure service availability, service sometimes does not respond.

Specifically, I want clear costmap services work well.

fujitatomoya commented 4 years ago

@daisukes

thanks, I would like to see the stack trace or core when exactly this hang up happens. just to make sure this happens in the ros2 system core stack or user implementation(in this case, clear costmap services). can you make this happen with other service example such as https://github.com/ros2/demos/tree/master/demo_nodes_cpp/src/services?

daisukes commented 4 years ago

After making sure the navigation2 system is working, I run a small program listed below to call a service (/global_costmap/clear_entirely_global_costmap) and stopped after waiting 15 secs and got the backtrace.

Also, run this snippet with 3-sec timeout hundreds of times. It randomly successes about 5% on my PC with Core i7-9750H @ 2.60GHz and about 70% on a server with Core i9-7920X @ 2.90GHz

Example works perfectly.

$ ros2 run --prefix 'gdb -ex run --args' service_test service_test
...
Starting program: /home/developer/ros2_ws/install/service_test/lib/service_test/service_test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6760700 (LWP 15611)]
[New Thread 0x7ffff5f5f700 (LWP 15612)]
[New Thread 0x7ffff575e700 (LWP 15613)]
[New Thread 0x7ffff4f5d700 (LWP 15614)]
[New Thread 0x7ffff475c700 (LWP 15615)]
[New Thread 0x7ffff3e9b700 (LWP 15616)]
[New Thread 0x7ffff369a700 (LWP 15617)]
[New Thread 0x7ffff2d6c700 (LWP 15618)]
^C
Thread 1 "service_test" received signal SIGINT, Interrupt.
futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffff45e0, clockid=<optimized out>, expected=0, futex_word=0x5555556d7e28)
    at ../sysdeps/nptl/futex-internal.h:320
320 ../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) bt
#0  futex_abstimed_wait_cancelable (private=<optimized out>, abstime=0x7fffffff45e0, clockid=<optimized out>, expected=0, futex_word=0x5555556d7e28)
    at ../sysdeps/nptl/futex-internal.h:320
#1  __pthread_cond_wait_common (abstime=0x7fffffff45e0, clockid=<optimized out>, mutex=0x5555556d7dd8, cond=0x5555556d7e00)
    at pthread_cond_wait.c:520
#2  __pthread_cond_timedwait (cond=0x5555556d7e00, mutex=0x5555556d7dd8, abstime=0x7fffffff45e0) at pthread_cond_wait.c:656
#3  0x00007ffff7cc6a7f in __gthread_cond_timedwait(pthread_cond_t*, pthread_mutex_t*, timespec const*) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#4  0x00007ffff7ccc9c4 in std::cv_status std::condition_variable::__wait_until_impl<std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#5  0x00007ffff7ccab80 in std::cv_status std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#6  0x00007ffff7d351fe in bool std::condition_variable::wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#7  0x00007ffff7d33d7d in bool std::condition_variable::wait_for<long, std::ratio<1l, 1000000000l>, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}>(std::unique_lock<std::mutex>&, std::chrono::duration<long, std::ratio<1l, 1000000000l> > const&, rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)::{lambda()#1}) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#8  0x00007ffff7d3373c in rclcpp::node_interfaces::NodeGraph::wait_for_graph_change(std::shared_ptr<rclcpp::Event>, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#9  0x00007ffff7cbf0fe in rclcpp::ClientBase::wait_for_service_nanoseconds(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()
   from /opt/ros/foxy/rclcpp/lib/librclcpp.so
#10 0x000055555556da4c in bool rclcpp::ClientBase::wait_for_service<long, std::ratio<1l, 1000l> >(std::chrono::duration<long, std::ratio<1l, 1000l> >) ()
#11 0x0000555555569ec0 in main ()

#include "nav2_costmap_2d/clear_costmap_service.hpp"

#include <chrono>
#include <cstdlib>
#include <memory>

using namespace std::chrono_literals;

int main(int argc, char **argv)
{
  rclcpp::init(argc, argv);

  std::shared_ptr<rclcpp::Node> node = rclcpp::Node::make_shared("clear_costmap_client");
  auto client = node->create_client<nav2_msgs::srv::ClearEntireCostmap>("/global_costmap/clear_entirely_global_costmap");
  auto request = std::make_shared<nav2_msgs::srv::ClearEntireCostmap::Request>();

  if (!client->wait_for_service()) { // change timeout for repeat
    if (!rclcpp::ok()) {
      RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Interrupted while waiting for the service. Exiting.");
      return 0;
    }
    RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "service not available.");
    return 0;
  }

  auto result = client->async_send_request(request);
  // Wait for the result.
  if (rclcpp::spin_until_future_complete(node, result) ==
    rclcpp::executor::FutureReturnCode::SUCCESS)
  {
    RCLCPP_INFO(rclcpp::get_logger("rclcpp"), "Succeeded");
  } else {
    RCLCPP_ERROR(rclcpp::get_logger("rclcpp"), "Failed");
  }

  rclcpp::shutdown();
  return 0;
}

fujitatomoya commented 4 years ago

https://github.com/ros2/ros2/issues/922 looks same with this problem.

hidmic commented 4 years ago

Indeed. @daisukes mind to try https://github.com/ros2/rmw_fastrtps/pull/390? It should mitigate the issue significantly.

daisukes commented 4 years ago

@fujitatomoya @hidmic Thank you for the information. I have tried that branch of rmw_fastrtps, however, it could not improve much. 15 success out of 213 trial, about 7%.

I found a thing, still not sure how it relates to the problem though. It looks like the success case is determined when you make a node. If you keep using a node that causes failure, I could not get success at all by calling waif_for_server multiple times even though you recreate service clients. However, if you get a node that works fine by chance, it keeps getting success.

hidmic commented 4 years ago

FYI @MiguelCompany

fujitatomoya commented 4 years ago

according to https://github.com/ros2/rclcpp/issues/1152#issuecomment-637261021, it is just a client stack, thread waiting for the event which is service availability. So as requester, service (looks) IS NOT ready. and the rest is the trace triggered by signal SIGINT.

maybe we could check the service client/server debug log to see the difference? Any thoughts?

daisukes commented 4 years ago

Sent requests from the client are sometimes taken by the server, but sometimes not.

planner_server_19898_1591165395685.log service_test_20379_1591165406625.log

service_test program creates a node and client, then call async_send_request without wait_for_service.

success case

client

[DEBUG] [1591165419.638036511] [rcl]: Client initialized
[DEBUG] [1591165419.638107866] [rcl]: Client sending service request
[DEBUG] [1591165419.638214040] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '0' timers, '0' clients, '0' services
[DEBUG] [1591165419.638345176] [rcl]: Waiting with timeout: 0s + 500000000ns
[DEBUG] [1591165419.638375476] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.638397164] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.638415441] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.638495935] [rcl]: Waiting with timeout: 0s + 499806226ns
[DEBUG] [1591165419.638511993] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.721666039] [rcl]: Client in wait set is ready
[DEBUG] [1591165419.721823218] [rcl]: Client taking service response
[DEBUG] [1591165419.722010101] [rcl]: Client take response succeeded: true
[INFO] [1591165419.722545187] [rclcpp]: 0 Successed

server

[DEBUG] [1591165419.711462786] [rcl]: Subscription taking message
[DEBUG] [1591165419.711502554] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165419.711638796] [rcl]: Waiting without timeout
[DEBUG] [1591165419.711654606] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.711665458] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.711718538] [rcl]: Waiting without timeout
[DEBUG] [1591165419.711729317] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.720513910] [rcl]: Service in wait set is ready
[DEBUG] [1591165419.720638429] [rcl]: Service server taking service request
[DEBUG] [1591165419.720692414] [rcl]: Service take request succeeded: true
[INFO] [1591165419.720715644] [global_costmap.global_costmap]: Received request to clear entirely the global_costmap
[DEBUG] [1591165419.721430447] [rcl]: Sending service response
[DEBUG] [1591165419.721573135] [rcl]: Waiting without timeout
[DEBUG] [1591165419.721592867] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165419.721614984] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165419.721699911] [rcl]: Waiting without timeout
[DEBUG] [1591165419.721718055] [rcl]: Timeout calculated based on next scheduled timer: false

failure case

client

[DEBUG] [1591165420.264981778] [rcl]: Client initialized
[DEBUG] [1591165420.265018811] [rcl]: Client sending service request
[DEBUG] [1591165420.265092068] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '0' timers, '0' clients, '0' services
[DEBUG] [1591165420.265168123] [rcl]: Waiting with timeout: 0s + 500000000ns
[DEBUG] [1591165420.265181556] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.265192436] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.265199093] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.265296772] [rcl]: Waiting with timeout: 0s + 499847515ns
[DEBUG] [1591165420.265308321] [rcl]: Timeout calculated based on next scheduled timer: false
[ERROR] [1591165420.765323394] [rclcpp]: 0 Failed

server

[DEBUG] [1591165420.261266585] [rcl]: Subscription in wait set is ready
[DEBUG] [1591165420.261353890] [rcl]: Subscription taking message
[DEBUG] [1591165420.261383022] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165420.261468070] [rcl]: Waiting without timeout
[DEBUG] [1591165420.261477170] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.261484102] [rcl]: Guard condition in wait set is ready
[DEBUG] [1591165420.261507740] [rcl]: Waiting without timeout
[DEBUG] [1591165420.261516860] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1591165420.281158735] [rcl]: Subscription in wait set is ready
[DEBUG] [1591165420.281240421] [rcl]: Subscription taking message
[DEBUG] [1591165420.281267092] [rcl]: Subscription take succeeded: true
[DEBUG] [1591165420.281324523] [rcl]: Waiting without timeout
[DEBUG] [1591165420.281331477] [rcl]: Timeout calculated based on next scheduled timer: false

daisukes commented 4 years ago

It works well if I use Fast-RTPS v1.10.0 instead of 2.0.x, which is changed in this April. https://github.com/ros2/ros2/commit/c7c115a7f9d78985903bac05d7dd0d886b808b61

hidmic commented 4 years ago

Sent requests from the client are sometimes taken by the server, but sometimes not.

Interesting.

service_test program creates a node and client, then call async_send_request without wait_for_service.

I agree that a client hanging isn't a nice way to fail a service call, but if you don't wait for the service server the chances of the response future ever becoming ready are pretty low.

sloretz commented 4 years ago

Closing in favor of https://github.com/ros2/rmw_fastrtps/issues/392 since these appear to be duplicates, and rmw_fastrtps is a more specifc repo

omertal88 commented 3 years ago

Hello @daisukes ,

Did you eventually solve it? I'm struggling with the same issue (I think it's the same). Some services just don't receive / respond to calls... I'm running ros2 galactic on ubuntu 20.04 with docker as well.

daisukes commented 3 years ago

Hello @omertal88,

I have been using ros:galactic image on dockerhub and no preblems so far.

I remember that we used to use Fast-RTPS(Fast-DDS) for foxy and needed to downgrade until the bug was fixed.

The default DDS is changed to CycloneDDS for galactic from Fast-DDS for foxy. https://discourse.ros.org/t/community-feedback-default-rmw-for-ros-2-galactic/16450

Have you checked different DDS implementations?

omertal88 commented 3 years ago

Wow, so I changed the rmw to use cyclonedds and the issues were gone! The thing is, I have some devices that run micro-ros which only supports Fast-RTPS. I hope I won't experience any issues with those devices. Also, coudln't find documentation about how to configure the discovery of CycloneDDS (I need to use uni-cast with devices that are connected wirelessly). Do you happen to know how?

omertal88 commented 3 years ago

@daisukes So in fact there are issues when calling services from CycloneDDS to Fast-RTPS. Response isn't returned. I found this open bug: https://github.com/ros2/rmw_cyclonedds/issues/184 and it seems there hasn't been any progress with it. I may try to convert my services to simple publishers + subscribers.

ros2 / rclcpp

Unreliable services #1152

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

success case

client

server

failure case

client

server