reference counting for user started capabilities

wjwwood commented 10 years ago

The idea is that if multiple apps use a capability then it should remain running until they are both done with it, at which time it should shut itself down.

tfoote commented 10 years ago

One thing to think about is what happens if the app server or remote requested goes down hard?

bit-pirate commented 10 years ago

@tfoote Can't think of an easy solution for that right now. A more complex solution include implementing states for capabilities, e.g. starting, idling, active, and then shutting down the idling capabilities after a certain time.

For now, leaving the capabilities running seems like a good simple option to me.

wjwwood commented 10 years ago

We discussed a heartbeat system, possibly using bond_core, which would mitigate the issue of the app server going down hard.

Assuming we can reliably track the usage of capabilities, I propose we make the behavior at shutdown configurable by the robot developer. I would argue the robot developer is the only person which has the knowledge about capability implementation startup costs to make a decision about this. The app developer may have some idea about how often it will be using and freeing a capability, but they still cannot know how expensive the shutdown/startup of those capabilities are.

There would be a parameter like ~unused_capability_shutdown_time. If set to zero then it is shutdown immediately, otherwise it will wait ~unused_capability_shutdown_time seconds before shutting the unused capabilities down.

As proposed before calling /stop_capability will shut it down immediately, even if others are using it. However, if everyone calls /free_capability and the reference count goes to zero then the capability server will wait ~unused_capability_shutdown_time and after that time, if the reference count is still zero then it will shutdown the capability.

wjwwood commented 10 years ago

I don't know how the heartbeat would be integrated, I need to look more closely at bond_core.

wjwwood commented 10 years ago

I am setting up a "capabilities client" API in python which will wrap up any calls to ROS along with bond creation and maintenance.

bit-pirate commented 10 years ago

Can't provide more inside into the bond_core. What I know is from using it with nodelets, but that doesn't cover more than what you already have described.

There would be a parameter like ~unused_capability_shutdown_time. If set to zero then it is shutdown immediately, otherwise it will wait ~unused_capability_shutdown_time seconds before shutting the unused capabilities down.

Could we add a no-shutdown behaviour for caps, when setting ~unused_capability_shutdown_time to 0 or -1? I'd like to have this option around, even though it might not be used often in the end.

wjwwood commented 10 years ago

@bit-pirate I am using bond_core only for it's heartbeat mechanism, I am writing a Python library that the app server would use, rather than calling the services directly, which will also setup a bond.

Preventing an unused capability from being shutdown by setting ~unused_capability_shutdown_time to -1, seems reasonable to me.

wjwwood commented 10 years ago

I am still working on this, testing it is proving to be interesting. I will hopefully have something by tomorrow night.

bit-pirate commented 10 years ago

What about testing it with the TB or Kobuki capabilities, since the base-related ones all depend on the bringup capability. Reserving, freeing and shutting down unused capabilities should all be covered, no?

wjwwood commented 10 years ago

I mean with the automated tests in a way that I can collect coverage information, I think I have it working now, it will just be a little longer for me to polish up.

wjwwood commented 10 years ago

Upgraded to a pull request.

This is still a work in progress because tests are failing and there is lacking coverage, but I believe that this might be functional, if you are interested in a preview. I'll continue to work on it.

There is a new Python module capabilities.client which provides a single point of entry for using and freeing capabilities, and which manages the Bonds.

Like I said, not fully baked, but a preview so you can give early feedback.

bit-pirate commented 10 years ago

Will you add a ROS API for this as well? So, far I am interacting with the capabilities (starting/stopping) using service calls.

wjwwood commented 10 years ago

The capabilities.client module uses the ROS API, but I wrapped that stuff up in Python for your convenience, otherwise you have to manage the bond's manually. Just have a look at the implementation, it is pretty straight forward.

wjwwood commented 10 years ago

Currently I create a bond for each call to /use_capability. This requires the /use_capability service to return a bond_id which is used by both sides to create the bond, and is passed to /free_capability to indicate to the capability sever which bond should be cleanly broken. I do not believe having multiple bonds greatly increases the overhead, but it is sort of redundant.

An alternative would be to have two services /get_user_id which returns a unique bond_id. A node which intends to call /use_capability or /free_capability would first call /get_user_id and then it would pass that user id (or bond_id) as part of the /use_capability and /free_capability service calls. This way I would maintain a set of reference counts for each user id and capability. That allows us to have only one bond per node and it also allows for a service call like /free_all_capabilities_by_user_id which takes a user_id and frees all capabilities which have one or more reference counts from that user id.

Thoughts?

bit-pirate commented 10 years ago

Do I understand right, that when using the alternative, there would be no bond between the node using/freeing the capability and the capability server?

Generally, I would like to have the option (not the requirement) to create a bond between the capability user and either the capability server (which would need to know which capabilities are used by that user) or all the capabilities that user uses.

When using capabilities with the app manager, I would use bonds to let the capability server know, when an app or the manager itself went down. However, some might not want to care about bonds, e.g. starting caps from the command line and rQt GUI, and should not be forced to.

I think that bond system would also be a nice addition to the app manager and apps. E.g. we currently have no way to determine, if an app is up or crashed.

wjwwood commented 10 years ago

Do I understand right, that when using the alternative, there would be no bond between the node using/freeing the capability and the capability server?

No, sorry there would be one bond for each call to /get_user_id.

When using capabilities with the app manager, I would use bonds to let the capability server know, when an app or the manager itself went down. However, some might not want to care about bonds, e.g. starting caps from the command line and rQt GUI, and should not be forced to.

I imagined that starting from the command line would use /start_capability and /stop_capability. The rqt gui should have no problem maintaining a bond. The problem with not using a bond for the use/free services is that you can get capability "use leaks", where a capability will never shut down.

So given /start_capability and /stop_capability will circumvent the reference counting, do you see a use case where we need a reference counting mechanism which does not use bonds?

bit-pirate commented 10 years ago

Do I understand right, that when using the alternative, there would be no bond between the node using/freeing the capability and the capability server?

No, sorry there would be one bond for each call to /get_user_id

OK, another try: The difference between both alternatives are one bond per app/capability user vs. one bond for each capability an app/capability user uses. Is that correct?

The second alternative sounds more light-weight, but I would actually like to try the first version to test the effects on system performance, when using many bonds. Bonds between an app and each capability it uses, could be used to ensure all required capabilities are up and running. Related question: How does the capability server track the status of the capabilities it launched?

So given /start_capability and /stop_capability will circumvent the reference counting, do you see a use case where we need a reference counting mechanism which does not use bonds?

Couldn't come up with a practical use case. The only thing I can think of is a node, which is not constantly connected to the ROS network, but still wants to interact with the capability server like normal always-connected nodes (i.e. freeing and using caps). Not sure, if this is a realistic use case, though.

Hence, I'm fine with making the bond connection mandatory for freeing and using caps. Maybe it would be good to not offer a ROS API for those methods to avoid misuse?

wjwwood commented 10 years ago

Bonds between an app and each capability it uses, could be used to ensure all required capabilities are up and running. Related question: How does the capability server track the status of the capabilities it launched?

So, to be clear, currently there is a bond for each time someone uses a capability, but the bond is always between the capability server and the user. There is no bond directly between the actual capability providers and the user. I track their running/not running status by monitoring the roslaunch process which launched them. If that terminates then I would break the bond which should indicate to the user that the capability has gone down, but my intention was always for the users of the capability server to instead monitor the ~events topic for those type of notifications.

Maybe it would be good to not offer a ROS API for those methods to avoid misuse?

The ROS API requires you give a user id, and I can prevent the services from succeeding until a valid user_id with an established bond is provided. So I don't think locking down the ROS API will be necessary. I will make sure to document the purpose and intended usage of the services related to this feature.

wjwwood commented 10 years ago

The big difference with the two options is that the option with only one bond per "user" is designed to only automatically free capabilities when the user fails hard. Whereas the option with a bond for each call to ~/use_capability is useful for notifying the user(s) of a capability that it has terminated. So when creating the bond (on the client side) you would need to setup a callback on bond break and do your handling there. On the other hand you can do a similar callback on the ~/events topic so I'm not sure it makes sense to use the bonds in this way.

Please let me know what you think would be better for your use case. I am leaning towards the one bond per user option, but the latter one is what is currently implemented.

wjwwood commented 10 years ago

I am working on a different implementation which only uses one bond per user.

bit-pirate commented 10 years ago

On the other hand you can do a similar callback on the ~/events topic so I'm not sure it makes sense to use the bonds in this way.

Those events are published by the caps server and are based on the status of roslaunch for each cap, right?

This makes me wonder how this status looks like. Can and do you track the status of each node started by roslaunch or only the status of the roslaunch process itself? If the latter is the case, every provider would need to use the <required> tag for each started node to ensure roslaunch goes down, if one node goes down. Otherwise I don't think tracking the state of a capability without bonds to each started node would reliable.

wjwwood commented 10 years ago

Those events are published by the caps server and are based on the status of roslaunch for each cap, right?

Correct.

Can and do you track the status of each node started by roslaunch or only the status of the roslaunch process itself?

I only track the roslaunch process.

If the latter is the case, every provider would need to use the tag for each started node to ensure roslaunch goes down, if one node goes down. Otherwise I don't think tracking the state of a capability without bonds to each started node would reliable.

I would recommend using the <required> tag.

bit-pirate commented 10 years ago

Those events are published by the caps server and are based on the status of roslaunch for each cap, right? Correct.

All right. I'll set up the monitoring for the app manager then.

I would recommend using the tag.

I think, we need to highlight this in the documentation.

bit-pirate commented 10 years ago

So, I finally got around testing this. Implementation looks super easy to use - awesome! :thumbsup:

However, I ran into bugs of which I was able to fix a few (see PR #40). I finally stopped at this error:

[INFO] [WallTime: 1386847702.899006] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id 'std_capabilities/DifferentialMobileBase0')
[ERROR] [WallTime: 1386847702.900363] Bond for capability 'std_capabilities/DifferentialMobileBase' with bond id 'std_capabilities/DifferentialMobileBase0' was unexpectedly broken, freeing capability
[INFO] [WallTime: 1386847702.902670] App Manager : Stopped required capability 'std_capabilities/DifferentialMobileBase'.
[ERROR] [WallTime: 1386847702.902919] AttributeError: 'NoneType' object has no attribute 'break_bond'
[ERROR] [WallTime: 1386847702.903840] bad callback: <bound method CapabilityServer.handle_capability_events of <capabilities.server.CapabilityServer object at 0x2e77c10>>
Traceback (most recent call last):
  File "/opt/ros/hydro/lib/python2.7/dist-packages/rospy/topics.py", line 681, in _invoke_callback
    cb(msg)
  File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 485, in handle_capability_events
    return self.__catch_and_log(self._handle_capability_events, event)
  File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 471, in __catch_and_log
    return func(*args, **kwargs)
  File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 506, in _handle_capability_events
    instance.terminated()
  File "/opt/kobuki_workspace/catkin_ws/src/capabilities/src/capabilities/server.py", line 241, in terminated
    bond.break_bond()
AttributeError: 'NoneType' object has no attribute 'break_bond'

@wjwwood could you check what's going wrong?

PS: For testing I had to merge with master, because of the many changes in the meanwhile.

coveralls commented 10 years ago

Coverage remained the same when pulling e2eaa5dcdcca2cd82c183efc11a44b3b9ac4cebb on issue_19 into 1cb5990169b346ef0d7925a9348ebe0920015419 on master.

wjwwood commented 10 years ago

I also have an implementation on the issue_19_simple branch which will use just one bond per capability user.

coveralls commented 10 years ago

Coverage remained the same when pulling 65d9d7056531689b1bd9b8ea387c469bc3d9582a on issue_19 into f065699a5da1161728c6c0b66d614dde667ccc5f on master.

esteve commented 10 years ago

I rebased this branch on top of master and it can now be merged cleanly, though I don't know what's the status of whether this branch is finished or not.

coveralls commented 10 years ago

Coverage remained the same when pulling ac504143fafc5aa7f10bbfd01de0372e777180ae on issue_19 into f065699a5da1161728c6c0b66d614dde667ccc5f on master.

bit-pirate commented 10 years ago

though I don't know what's the status of whether this branch is finished or not.

Did the above reported error get fixed? If yes, then I'll do another test and report back.

bit-pirate commented 10 years ago

The above error seems fixed. However, I added a ticket for the missing timeout handling here: https://github.com/osrf/capabilities/issues/46

Haven't yet tested the issue_19_simple branch. @wjwwood please let me know when there is something ready to be tested.

coveralls commented 10 years ago

Coverage remained the same when pulling 5adad1f554f88f1ca8c2582d1dd3bf6930e95b00 on issue_19 into eab4f35611d692430fdb1722612959112649609a on master.

osrf / capabilities

reference counting for user started capabilities #19