Closed wjwwood closed 10 years ago
One thing to think about is what happens if the app server or remote requested goes down hard?
@tfoote Can't think of an easy solution for that right now. A more complex solution include implementing states for capabilities, e.g. starting, idling, active, and then shutting down the idling capabilities after a certain time.
For now, leaving the capabilities running seems like a good simple option to me.
We discussed a heartbeat system, possibly using bond_core
, which would mitigate the issue of the app server going down hard.
Assuming we can reliably track the usage of capabilities, I propose we make the behavior at shutdown configurable by the robot developer. I would argue the robot developer is the only person which has the knowledge about capability implementation startup costs to make a decision about this. The app developer may have some idea about how often it will be using and freeing a capability, but they still cannot know how expensive the shutdown/startup of those capabilities are.
There would be a parameter like ~unused_capability_shutdown_time
. If set to zero then it is shutdown immediately, otherwise it will wait ~unused_capability_shutdown_time
seconds before shutting the unused capabilities down.
As proposed before calling /stop_capability
will shut it down immediately, even if others are using it. However, if everyone calls /free_capability
and the reference count goes to zero then the capability server will wait ~unused_capability_shutdown_time
and after that time, if the reference count is still zero then it will shutdown the capability.
I don't know how the heartbeat would be integrated, I need to look more closely at bond_core
.
I am setting up a "capabilities client" API in python which will wrap up any calls to ROS along with bond creation and maintenance.
Can't provide more inside into the bond_core. What I know is from using it with nodelets, but that doesn't cover more than what you already have described.
There would be a parameter like ~unused_capability_shutdown_time. If set to zero then it is shutdown immediately, otherwise it will wait ~unused_capability_shutdown_time seconds before shutting the unused capabilities down.
Could we add a no-shutdown behaviour for caps, when setting ~unused_capability_shutdown_time
to 0 or -1? I'd like to have this option around, even though it might not be used often in the end.
@bit-pirate I am using bond_core only for it's heartbeat mechanism, I am writing a Python library that the app server would use, rather than calling the services directly, which will also setup a bond.
Preventing an unused capability from being shutdown by setting ~unused_capability_shutdown_time
to -1
, seems reasonable to me.
I am still working on this, testing it is proving to be interesting. I will hopefully have something by tomorrow night.
What about testing it with the TB or Kobuki capabilities, since the base-related ones all depend on the bringup capability. Reserving, freeing and shutting down unused capabilities should all be covered, no?
I mean with the automated tests in a way that I can collect coverage information, I think I have it working now, it will just be a little longer for me to polish up.
Upgraded to a pull request.
This is still a work in progress because tests are failing and there is lacking coverage, but I believe that this might be functional, if you are interested in a preview. I'll continue to work on it.
There is a new Python module capabilities.client
which provides a single point of entry for using and freeing capabilities, and which manages the Bonds.
Like I said, not fully baked, but a preview so you can give early feedback.
Will you add a ROS API for this as well? So, far I am interacting with the capabilities (starting/stopping) using service calls.
The capabilities.client
module uses the ROS API, but I wrapped that stuff up in Python for your convenience, otherwise you have to manage the bond's manually. Just have a look at the implementation, it is pretty straight forward.
Currently I create a bond for each call to /use_capability
. This requires the /use_capability
service to return a bond_id
which is used by both sides to create the bond, and is passed to /free_capability
to indicate to the capability sever which bond should be cleanly broken. I do not believe having multiple bonds greatly increases the overhead, but it is sort of redundant.
An alternative would be to have two services /get_user_id
which returns a unique bond_id
. A node which intends to call /use_capability
or /free_capability
would first call /get_user_id
and then it would pass that user id (or bond_id) as part of the /use_capability
and /free_capability
service calls. This way I would maintain a set of reference counts for each user id and capability. That allows us to have only one bond per node and it also allows for a service call like /free_all_capabilities_by_user_id
which takes a user_id and frees all capabilities which have one or more reference counts from that user id.
Thoughts?
Do I understand right, that when using the alternative, there would be no bond between the node using/freeing the capability and the capability server?
Generally, I would like to have the option (not the requirement) to create a bond between the capability user and either the capability server (which would need to know which capabilities are used by that user) or all the capabilities that user uses.
When using capabilities with the app manager, I would use bonds to let the capability server know, when an app or the manager itself went down. However, some might not want to care about bonds, e.g. starting caps from the command line and rQt GUI, and should not be forced to.
I think that bond system would also be a nice addition to the app manager and apps. E.g. we currently have no way to determine, if an app is up or crashed.
Do I understand right, that when using the alternative, there would be no bond between the node using/freeing the capability and the capability server?
No, sorry there would be one bond for each call to /get_user_id
.
When using capabilities with the app manager, I would use bonds to let the capability server know, when an app or the manager itself went down. However, some might not want to care about bonds, e.g. starting caps from the command line and rQt GUI, and should not be forced to.
I imagined that starting from the command line would use /start_capability
and /stop_capability
. The rqt gui should have no problem maintaining a bond. The problem with not using a bond for the use/free
services is that you can get capability "use leaks", where a capability will never shut down.
So given /start_capability
and /stop_capability
will circumvent the reference counting, do you see a use case where we need a reference counting mechanism which does not use bonds?
Do I understand right, that when using the alternative, there would be no bond between the node using/freeing the capability and the capability server?
No, sorry there would be one bond for each call to /get_user_id
OK, another try: The difference between both alternatives are one bond per app/capability user vs. one bond for each capability an app/capability user uses. Is that correct?
The second alternative sounds more light-weight, but I would actually like to try the first version to test the effects on system performance, when using many bonds. Bonds between an app and each capability it uses, could be used to ensure all required capabilities are up and running. Related question: How does the capability server track the status of the capabilities it launched?
So given /start_capability and /stop_capability will circumvent the reference counting, do you see a use case where we need a reference counting mechanism which does not use bonds?
Couldn't come up with a practical use case. The only thing I can think of is a node, which is not constantly connected to the ROS network, but still wants to interact with the capability server like normal always-connected nodes (i.e. freeing and using caps). Not sure, if this is a realistic use case, though.
Hence, I'm fine with making the bond connection mandatory for freeing and using caps. Maybe it would be good to not offer a ROS API for those methods to avoid misuse?
Bonds between an app and each capability it uses, could be used to ensure all required capabilities are up and running. Related question: How does the capability server track the status of the capabilities it launched?
So, to be clear, currently there is a bond for each time someone uses a capability, but the bond is always between the capability server and the user. There is no bond directly between the actual capability providers and the user. I track their running/not running status by monitoring the roslaunch process which launched them. If that terminates then I would break the bond which should indicate to the user that the capability has gone down, but my intention was always for the users of the capability server to instead monitor the ~events
topic for those type of notifications.
Maybe it would be good to not offer a ROS API for those methods to avoid misuse?
The ROS API requires you give a user id, and I can prevent the services from succeeding until a valid user_id with an established bond is provided. So I don't think locking down the ROS API will be necessary. I will make sure to document the purpose and intended usage of the services related to this feature.
The big difference with the two options is that the option with only one bond per "user" is designed to only automatically free capabilities when the user fails hard. Whereas the option with a bond for each call to ~/use_capability
is useful for notifying the user(s) of a capability that it has terminated. So when creating the bond (on the client side) you would need to setup a callback on bond break and do your handling there. On the other hand you can do a similar callback on the ~/events
topic so I'm not sure it makes sense to use the bonds in this way.
Please let me know what you think would be better for your use case. I am leaning towards the one bond per user option, but the latter one is what is currently implemented.
I am working on a different implementation which only uses one bond per user.
On the other hand you can do a similar callback on the ~/events topic so I'm not sure it makes sense to use the bonds in this way.
Those events are published by the caps server and are based on the status of roslaunch for each cap, right?
This makes me wonder how this status looks like. Can and do you track the status of each node started by roslaunch or only the status of the roslaunch process itself? If the latter is the case, every provider would need to use the <required>
tag for each started node to ensure roslaunch goes down, if one node goes down. Otherwise I don't think tracking the state of a capability without bonds to each started node would reliable.
Those events are published by the caps server and are based on the status of roslaunch for each cap, right?
Correct.
Can and do you track the status of each node started by roslaunch or only the status of the roslaunch process itself?
I only track the roslaunch process.
If the latter is the case, every provider would need to use the
tag for each started node to ensure roslaunch goes down, if one node goes down. Otherwise I don't think tracking the state of a capability without bonds to each started node would reliable.
I would recommend using the <required>
tag.
Those events are published by the caps server and are based on the status of roslaunch for each cap, right? Correct.
All right. I'll set up the monitoring for the app manager then.
I would recommend using the
tag.
I think, we need to highlight this in the documentation.
So, I finally got around testing this. Implementation looks super easy to use - awesome! :thumbsup:
However, I ran into bugs of which I was able to fix a few (see PR #40). I finally stopped at this error:
[INFO] [WallTime: 1386847702.899006] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id 'std_capabilities/DifferentialMobileBase0')
[ERROR] [WallTime: 1386847702.900363] Bond for capability 'std_capabilities/DifferentialMobileBase' with bond id 'std_capabilities/DifferentialMobileBase0' was unexpectedly broken, freeing capability
[INFO] [WallTime: 1386847702.902670] App Manager : Stopped required capability 'std_capabilities/DifferentialMobileBase'.
[ERROR] [WallTime: 1386847702.902919] AttributeError: 'NoneType' object has no attribute 'break_bond'
[ERROR] [WallTime: 1386847702.903840] bad callback: <bound method CapabilityServer.handle_capability_events of <capabilities.server.CapabilityServer object at 0x2e77c10>>
Traceback (most recent call last):
File "/opt/ros/hydro/lib/python2.7/dist-packages/rospy/topics.py", line 681, in _invoke_callback
cb(msg)
File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 485, in handle_capability_events
return self.__catch_and_log(self._handle_capability_events, event)
File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 471, in __catch_and_log
return func(*args, **kwargs)
File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 506, in _handle_capability_events
instance.terminated()
File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 241, in terminated
bond.break_bond()
AttributeError: 'NoneType' object has no attribute 'break_bond'
@wjwwood could you check what's going wrong?
PS: For testing I had to merge with master, because of the many changes in the meanwhile.
I also have an implementation on the issue_19_simple
branch which will use just one bond per capability user.
I rebased this branch on top of master and it can now be merged cleanly, though I don't know what's the status of whether this branch is finished or not.
though I don't know what's the status of whether this branch is finished or not.
Did the above reported error get fixed? If yes, then I'll do another test and report back.
The above error seems fixed. However, I added a ticket for the missing timeout handling here: https://github.com/osrf/capabilities/issues/46
Haven't yet tested the issue_19_simple branch. @wjwwood please let me know when there is something ready to be tested.
The idea is that if multiple apps use a capability then it should remain running until they are both done with it, at which time it should shut itself down.