Request feedback informing of lost resources

stonier commented 10 years ago

From a discussion started in #4.

I had not considered the case of services that do not need to actually maintain contact with the resources they allocate. A concrete example or two would probably clarify those requirements for me.

If that is a common situation, we may need some additional request states or other mechanism for notifying the requester that previously allocated resources are no longer available. It looks similar to the PREEMPTING -> PREEMPTED state transitions, and maybe they would be adequate. But, some indication of the cause might help

Need to outline a couple of use cases to drive this.

jack-oquin commented 10 years ago

Also from #4:

The scheduler will presumably be monitoring entry and exit from the concert, so it makes sense to use that mechanism for notification, when necessary.

My current suggestion is: rather than add new status transitions, add a uint16 reason field to the Request message. When the status changes, this field can be updated to provide additional information. Reasons might include things like: preempted, lost contact, went away, etc.

jack-oquin commented 10 years ago

With the revised message update in robotics-in-concert/rocon_msgs#65, this requirement can be handled by the scheduler as follows:

rq.preempt(reason=Request.UNAVAILABLE)
send updated Request to the requester's feedback topic

stonier commented 10 years ago

Trying to think about the best way to handle individually lost resources in the resource set of a request. Some thoughts:

1) Information that would be important to the requester - which resource(s) became unavailable?

2) Even after a resource becomes unavailable, it would be useful for a service to be notified if further resources became unavailable (and which resources). e.g. after losing a resource, the service may typically want the remaining robots to finish their jobs, cleanup and go home before cancelling the request. Still important to know if those remaining resources drop out too.

3) More difficult - recovery of a request. I was playing around with this in my requester - given a 3 resource request, if one drops out would be useful for the requester to be able to simply make an 'on the fly' request for an additional resource without having to lose or even interrupt the use of the first two resources. If such a resource isn't available in the short term, the service logic would only then get the robots to finish what they're are doing, cleanup, go home, cancel and then make a brand new request.

jack-oquin commented 10 years ago

Yeah. I was forgetting about all that, just looking at requests as all-or-nothing.

If a single request for three robots was initially queued because only two were available, nothing would be allocated until a third robot checked in. Similarly, if they are no longer all available, the scheduler no longer considers that request viable. When the requester receives the PREEMPTING message, its resources are all still assigned even though not all available.

I suppose the requester could cancel the failed request and immediately queue new requests for any remaining devices. That seems quite awkward. Maybe we should provide a new transition back to GRANTED for whatever remains, with the missing resource(s) deleted.

We had some discussion of a general device discovery interface (#9). At one time, I thought the conductor was providing that information, but only the scheduler knows what is allocated and what is available. Perhaps that mechanism could be useful here.

jack-oquin commented 10 years ago

If we had some way of linking sets of requests, it would be easier to provide piecemeal failure notification.

But, much of the complexity would remain. If three requests are grouped together and one of them fails, what does a requester do?

It can continue with what remains without risking the deadlock that the original grouping was designed to avoid.
But, if it holds on to the remainder and makes a new request for the lost device, we fall back into the dining philosophers problem.

stonier commented 10 years ago

If three requests are grouped together and one of them fails, what does a requester do?

I think you mean with the current code 'three resources'? I suspect you're thinking of what you mentioned in another thread - combinations of requests formulating a single request (a and (b or c)).

I think that's a great question because I don't have a definitive answer. It depends very much on what the requester wants to do - and there are several very valid options all depending on how it wants to handle this situation. e.g.

The requester immediately cancel all resources and issue a new request.
The requester delays, finishes current jobs with the remaining resources, cleans up, then cancels and issues a new request.
The requester holds on to the remainder and makes a new request for the lost device.
The scheduler tries to fill the missing resource.

Some misgivings about these. 1. Very impractical having two robots stop, go home, only to come back out and commence immediately because a third member was replaced. 2. Suitable in some situations, not others. 4. unfortunately locks the requester to a single behavioural response, and has issues about the dining room philosopher.

I tried to play around with 3. in my ResourcePoolRequester for chatter/turtle concerts. It uses timeouts to 'dumbly' avoid the deadlock and cancels/reissues everything when the timeout is reached.

An interesting feature might be to let a requester 'merge' separate requests for which some have lost resources. That'd be much more elegant than the juggling around I'm doing with half-filled and new requests all over the place in the ResourcePoolRequester. This is not critical though - as all that juggling is under the hood in the custom requester.

jack-oquin commented 10 years ago

If three requests are grouped together and one of them fails, what does a requester do?

I think you mean with the current code 'three resources'? I suspect you're thinking of what you mentioned in another thread - combinations of requests formulating a single request (a and (b or c)).

In that message I was thinking of an alternative idea which we discussed briefly several weeks ago in our Hangout conversation: allowing one resource per request, while providing "request groups" which must be allocated together.

Our current approach with multiple resources per requests is simpler and seems better, but it does create complications when there are partial device outages.

The requester immediately cancel all resources and issue a new request.

This works already.

In some simple applications it may be fine, especially if a scheduler somehow takes location into account when allocating from a pool of otherwise equivalent devices.

In other cases, your misgiving of robots wandering home only to return immediately is disturbing and important to keep in mind.

The requester delays, finishes current jobs with the remaining resources, cleans up, then cancels and issues a new request.

This can be done by leaving the request in the PREEMPTING state for a while. I intend to provide a timeout for that, similar to the loss-of-contact timeout I implemented recently.

That preemption timeout is not currently implemented, so the requester can hang onto remaining preempted resources as long as it wants to, right now. But, we need a mechanism for the scheduler to allow a reasonable period of time for clean-up without tying up resources forever. The requester and scheduler could possibly even negotiate how long to wait using the hold_time and availability fields. The scheduler's preempt() call could set availability to some reasonable time limit, and the requester could update hold_time if it has different requirements.

The requester holds on to the remainder and makes a new request for the lost device.

This can probably be done by adding a hold() transition from the PREEMPTING state back to GRANTED status, minus any unavailable resources. If we decide to add this, we'll need to work out the details, which may involve an additional request state.

The scheduler tries to fill the missing resource.

As you say, that could lead to deadlocks. I suppose it could be done conditionally, with all the resources preempted only if a deadlock actually occurs. I am pretty sure there are reasonable algorithms a scheduler could use for detecting loops in the "wait-for" graph.

An interesting feature might be to let a requester 'merge' separate requests for which some have lost resources. That'd be much more elegant than the juggling around I'm doing with half-filled and new requests all over the place in the ResourcePoolRequester. This is not critical though - as all that juggling is under the hood in the custom requester.

Certainly worth consideration. Our current interface does not provide any reasonable mechanism for expressing things like that. Each request has a unique ID and operates independently of all the others. The "request group" idea mentioned above might provide a useful mechanism for re-juggling sets of requests that are needed together.

jack-oquin commented 10 years ago

I have not done any more work specifically for this issue, but the simple scheduler implementation (utexas-bwi/rocon_experimental#8) does now publish some of the desired information on its /resource_pool topic.

Resources that had previously been known to the scheduler are marked with a MISSING status, but remain in the pool. A previously-allocated but now missing resource remains allocated to the original request. The requester can detect the status change by subscribing to /resource_pool, although it would be more convenient if we provided an API for tracking changes to requested resources.

If the missing, allocated resource rejoins the Concert, it should revert to ALLOCATED status without any requester intervention. I say "should" because I doubt that situation has been tested yet.

So, all allocated resources belong to the requester until it cancels them, or the scheduler preempts them, whether MISSING or not.

I suggest that we experiment with what is currently implemented to get a better feel for what ROCON services actually want to do in these tricky situations.

stonier commented 10 years ago

Yeah, I agree.

About to embark on the whole leaving/rejoining client problem - I'm afraid the concert will handle all these corner cases poorly. But we needed to get the gateway upgrades that @piyushk has put in before we tackled these.

I also agree we should test the scheduler on current simpler use cases for now - doing that with the fake teleop scenario.

stonier commented 10 years ago

Closing this here, continuing relevant topics over in https://github.com/utexas-bwi/concert_scheduling/issues/23 and https://github.com/utexas-bwi/concert_scheduling/issues/24.

utexas-bwi / rocon_scheduler_requests

Request feedback informing of lost resources #10