osrf / capabilities

Implements the concept of capabilities as part of the robots-in-concert system.
Other
8 stars 26 forks source link

Stopping multiple capabilities can cause errors #68

Open bit-pirate opened 10 years ago

bit-pirate commented 10 years ago

Currently testing the rocon_apps/teleop rapp with capability support and noticed a strange error when shutting down the rapp and its dependent capabilities.

[INFO] [WallTime: 1399886941.441992] Rapp Manager : Stopping rapp 'rocon_apps/teleop'.
[turtlebot/relay_video-3] killing on exit
[turtlebot/relay_cmd_vel-2] killing on exit
[turtlebot/teleop_velocity_smoother-1] killing on exit
[ INFO] [1399886941.469684467]: Unloading nodelet /turtlebot/teleop_velocity_smoother from manager /capability_server_nodelet_manager
[INFO] [WallTime: 1399886942.874621] Rapp Manager : stopped rapp [rocon_apps/teleop]'.
[INFO] [WallTime: 1399886942.875491] Rapp Manager : Stopping required capabilities.
[INFO] [WallTime: 1399886942.884842] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id 'b054c6ae-d9b7-11e3-b584-2cd05ab796d3')
[INFO] [WallTime: 1399886942.885948] Capability 'std_capabilities/DifferentialMobileBase' being stopped because it has zero references
[depthimage_to_laserscan-18] killing on exit
[camera/disparity_registered_hw-17] killing on exit
[camera/disparity_registered_sw-16] killing on exit
[camera/disparity_depth-15] killing on exit
[camera/points_xyzrgb_hw_registered-14] killing on exit
 [camera/points_xyzrgb_sw_registered-12] killing on exit
[camera/depth_registered_rectify_depth-13] killing on exit
[camera/register_depth_rgb-11] killing on exit
[camera/depth_points-10] killing on exit
[camera/depth_metric-9] killing on exit
[camera/depth_metric_rect-8] killing on exit
[camera/depth_rectify_depth-7] killing on exit
[camera/rectify_ir-6] killing on exit
[camera/rectify_color-5] killing on exit
[camera/rectify_mono-4] killing on exit
[camera/debayer-3] killing on exit
[camera/driver-2] killing on exit
[camera/camera_nodelet_manager-1] killing on exit
[camera/camera_nodelet_manager-1] escalating to SIGTERM
shutting down processing monitor...
... shutting down processing monitor complete
done
[turtlebot_bringup-2] killing on exit
Traceback (most recent call last):
  File "/home/yujin/ros_ws/turtlebot_ws/src/turtlebot/turtlebot_capabilities/providers/launch/placeholder.py", line 6, in <module>
    sleep(0.1)
KeyboardInterrupt
shutting down processing monitor...
... shutting down processing monitor complete
done
[robot_state_publisher-1] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done
[ERROR] [WallTime: 1399886960.565455] Capability Instance 'turtlebot_capabilities/rgbd_sensor' cannot transition to 'stopping' from anything but 'launching' or 'running', current state is 'stopping'                                                                    
[INFO] [WallTime: 1399886960.566842] Capability Provider 'turtlebot_capabilities/rgbd_sensor' for Capability 'std_capabilities/RGBDSensor' has terminated.
[INFO] [WallTime: 1399886960.567999] Capability Provider 'turtlebot_capabilities/turtlebot2_bringup' for Capability 'turtlebot_capabilities/TurtleBotBringup' has terminated.
[INFO] [WallTime: 1399886960.568654] Stopping the 'turtlebot_capabilities/diagnostics' provider of the 'std_capabilities/Diagnostics' interface, because it has no dependents left.
[ERROR] [WallTime: 1399886960.570503] RuntimeError: No running launch file with PID of '815'
[diagnostic_aggregator-1] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done
[INFO] [WallTime: 1399886961.621882] Capability Provider 'turtlebot_capabilities/robot_state_publisher' for Capability 'std_capabilities/RobotStatePublisher' has terminated.
[INFO] [WallTime: 1399886961.622872] Capability Provider 'turtlebot_capabilities/diagnostics' for Capability 'std_capabilities/Diagnostics' has terminated.
^C[zeroconf/zeroconf-6] killing on exit
[interactions-5] killing on exit
[app_manager-3] killing on exit
 [master-4] killing on exit
[capability_server-2] killing on exit
Traceback (most recent call last):
  File "/home/yujin/ros_ws/rocon_ws/src/capabilities/src/capabilities/placeholder_script", line 6, in <module>
    sleep(0.1)
KeyboardInterrupt
[ERROR] [WallTime: 1399887012.879617] Capability Instance 'kobuki_capabilities/kobuki_differential_mobile_base' terminated unexpectedly, it was previously in the 'running' state.
[INFO] [WallTime: 1399887012.880260] Capability Provider 'kobuki_capabilities/kobuki_differential_mobile_base' for Capability 'std_capabilities/DifferentialMobileBase' has terminated.
[INFO] [WallTime: 1399887012.880685] Stopping the 'kobuki_capabilities/kobuki_bringup' provider of the 'kobuki_capabilities/KobukiBringup' interface, because it has no dependents left.

So, here an error is thrown when the RGBDSensor capability is requested to be stopped, while it is in the stopping state. This error is reported in the service response, which makes the rapp manager abort the stopping of the other caps.

I see two improvements required.

I'll look into the stopping capability issue in the rapp manager (https://github.com/robotics-in-concert/rocon_app_platform/issues/233). @wjwwood could you take a look at the capability server part of this issue?

bit-pirate commented 10 years ago

After I implemented better exception handling in the rapp manager, I can see that there seems to be a mix up with the launchers PIDs:

[INFO] [WallTime: 1399896717.099010] Rapp Manager : stopped rapp [rocon_apps/teleop]'.
[INFO] [WallTime: 1399896717.099275] Rapp Manager : Stopping required capabilities.
[INFO] [WallTime: 1399896717.102842] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id '85351b84-d9cd-11e3-b0f9-f46d04929542')
[INFO] [WallTime: 1399896717.103181] Capability 'std_capabilities/DifferentialMobileBase' being stopped because it has zero references
[depthimage_to_laserscan-18] killing on exit
[camera/disparity_registered_hw-17] killing on exit
[camera/disparity_registered_sw-16] killing on exit
 [camera/points_xyzrgb_hw_registered-14] killing on exit
[camera/disparity_depth-15] killing on exit
 [camera/depth_registered_rectify_depth-13] killing on exit
[camera/points_xyzrgb_sw_registered-12] killing on exit
[camera/register_depth_rgb-11] killing on exit
[camera/depth_points-10] killing on exit
[camera/depth_metric-9] killing on exit
[camera/depth_metric_rect-8] killing on exit
[camera/depth_rectify_depth-7] killing on exit
[camera/rectify_ir-6] killing on exit
[camera/rectify_color-5] killing on exit
[camera/rectify_mono-4] killing on exit
[camera/debayer-3] killing on exit
[camera/driver-2] killing on exit
[camera/camera_nodelet_manager-1] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done
[INFO] [WallTime: 1399896719.540929] Capability Provider 'turtlebot_capabilities/rgbd_sensor' for Capability 'std_capabilities/RGBDSensor' has terminated.
[turtlebot_bringup-2] killing on exit
Traceback (most recent call last):
  File "/opt/kobuki_workspace/turtlebot/src/turtlebot/turtlebot_capabilities/providers/launch/placeholder.py", line 6, in <module>
    sleep(0.1)
KeyboardInterrupt
shutting down processing monitor...
... shutting down processing monitor complete
done
[robot_state_publisher-1] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete
done
[ERROR] [WallTime: 1399896720.232718] Capability Instance 'turtlebot_capabilities/turtlebot2_bringup' cannot transition to 'stopping' from anything but 'launching' or 'running', current state is 'stopping'
[INFO] [WallTime: 1399896720.233108] Capability Provider 'turtlebot_capabilities/turtlebot2_bringup' for Capability 'turtlebot_capabilities/TurtleBotBringup' has terminated.
[INFO] [WallTime: 1399896720.234642] Stopping the 'turtlebot_capabilities/diagnostics' provider of the 'std_capabilities/Diagnostics' interface, because it has no dependents left.
[ERROR] [WallTime: 1399896720.235318] RuntimeError: No running launch file with PID of '13277'
[INFO] [WallTime: 1399896720.240342] Request to free usage of capability 'std_capabilities/RGBDSensor' (bond id '85351b84-d9cd-11e3-b0f9-f46d04929542')
[ERROR] [WallTime: 1399896720.240845] RuntimeError: No Capability 'std_capabilities/RGBDSensor' in use
[ERROR] [WallTime: 1399896720.241382] Rapp Manager : Failed to stop capabilities. Errors: ["Exception occurred while stopping cap 'std_capabilities/DifferentialMobileBase': service [/capability_server/free_capability] responded with an error: error processing request: No running launch file with PID of '13277'", "Exception occurred while stopping cap 'std_capabilities/RGBDSensor': service [/capability_server/free_capability] responded with an error: error processing request: No Capability 'std_capabilities/RGBDSensor' in use"]

As you can see in the above example (check the last few lines) , the capability server cannot find the corresponding launch file PID, when stopping the std_capabilities/DifferentialMobileBase capability. As a result this capability remains running.

Interesting is, that this capability launcher seems to be registered under a different PID:

$ rosservice call /capability_server/get_running_capabilities 
running_capabilities: 
  - 
    capability: 
      capability: kobuki_capabilities/KobukiBringup
      provider: kobuki_capabilities/kobuki_bringup
    dependent_capabilities: 
      - 
        capability: std_capabilities/DifferentialMobileBase
        provider: kobuki_capabilities/kobuki_differential_mobile_base
    started_by: kobuki_capabilities/kobuki_differential_mobile_base
    pid: 13264
  - 
    capability: 
      capability: std_capabilities/DifferentialMobileBase
      provider: kobuki_capabilities/kobuki_differential_mobile_base
    dependent_capabilities: []
    started_by: user service call
    pid: 13267

Also, stopping this capability by hand works fine.

t$ rosservice call /capability_server/stop_capability "capability: 'std_capabilities/DifferentialMobileBase'"                                                                                                                                
successful: True        
[INFO] [WallTime: 1399896720.970543] Capability Provider 'turtlebot_capabilities/robot_state_publisher' for Capability 'std_capabilities/RobotStatePublisher' has terminated.
[INFO] [WallTime: 1399896720.971037] Capability Provider 'turtlebot_capabilities/diagnostics' for Capability 'std_capabilities/Diagnostics' has terminated.
[INFO] [WallTime: 1399896883.873052] Request to stop capability 'std_capabilities/DifferentialMobileBase'
[INFO] [WallTime: 1399896883.874446] Capability Provider 'kobuki_capabilities/kobuki_differential_mobile_base' for Capability 'std_capabilities/DifferentialMobileBase' has terminated.
[INFO] [WallTime: 1399896883.874859] Stopping the 'kobuki_capabilities/kobuki_bringup' provider of the 'kobuki_capabilities/KobukiBringup' interface, because it has no dependents left.
[mobile_base-1] killing on exit
[ INFO] [1399896883.960348772]: Kobuki : waiting for kobuki thread to finish [/mobile_base].
shutting down processing monitor...
... shutting down processing monitor complete
done
[INFO] [WallTime: 1399896885.430019] Capability Provider 'kobuki_capabilities/kobuki_bringup' for Capability 'kobuki_capabilities/KobukiBringup' has terminated.
wjwwood commented 10 years ago

Ok, I'll try to look into in what cases the PID would not be available and see if I can improve the error handling as well as figure out why it might be happening and close that loop hole.

wjwwood commented 10 years ago

Requesting to stop a stopping cap probably shouldn't be treated as an error

So what should happen if you try to stop a capability which is not running at all? I would say an error and so if you treat a "stopping" capability differently then you are in a race condition to get an exception, which I think is pretty bad.

I think instead that if the code which is calling stop capability might be doing so redundantly then it should explicitly catch and pass on that failure. Do you have some use case in mind where that doesn't make sense?

wjwwood commented 10 years ago

I still haven't found where this problem of missing PID's is coming in from. I'll keep digging.

bit-pirate commented 10 years ago

Requesting to stop a stopping cap probably shouldn't be treated as an error

So what should happen if you try to stop a capability which is not running at all?

Well, here the capability is still running, but currently shutting down. Hence, stopping a already stopping cap could be treated with sending out a warning at max.

My user code (the rapp manager) actually doesn't actually call stop capability at all. It's only using use/free capability. In the present case it seems like the long shut down time of the RGBDSensor cap leads to some kind of race condition.

The strange things to me here are:

bit-pirate commented 10 years ago

If you like to test this yourself, you need

Then

$ roslaunch turtlebot_bringup rapp_platform.launch --screen

And

$ rosservice call /turtlebot/start_rapp "name: 'rocon_apps/teleop'"
wjwwood commented 10 years ago

Well, here the capability is still running, but currently shutting down.

That's my point, if you treat calling "stop capability" on a terminated (not running) and a stopping capability differently then you will have race condition dependent behavior. I don't think that is great, therefore I would say calling stop on a stopping capability is an error.

My user code (the rapp manager) actually doesn't actually call stop capability at all. It's only using use/free capability.

There we agree, there is something fishy going on which is exposing this problem internally, but I think this is separate from what should be done when stop is called (externally or internally) on a stopping capability.

wjwwood commented 10 years ago

Thanks I'll try to set that up locally and reproduce the same issue.

bit-pirate commented 10 years ago

Well, here the capability is still running, but currently shutting down.

That's my point, if you treat calling "stop capability" on a terminated (not running) and a stopping capability differently then you will have race condition dependent behavior. I don't think that is great, therefore I would say calling stop on a stopping capability is an error.

:+1:

bit-pirate commented 10 years ago

@wjwwood is there any progress on this? It's a real show stopper right now. Without this fixed, I can't merge rapps with caps support with the main branches in ROCON.

wjwwood commented 10 years ago

@bit-pirate I made a little progress about three days ago getting the environment setup and identifying potential problems, but I haven't fixed the issue yet. I'll spend time on it today.