Timing of new requests - Githubissues

stonier commented 10 years ago

I seem to be running into trouble in ResourceRequests' _reconcile method.

Usage scenario:

requester.cancel_all()
requester.send_requests()
// formulate some list of resources for a new request
requester.new_request(new_list_of_resources, priority=10)

Here it looks like it can get into trouble.

Requester sends cancel all requests to the scheduler.
Scheduler handles the cancel all request and sends it back
Requester prepares a new request
Requester handles the response in requester._feedback
- In self.rset.merge(new_rset) - problem

Inside merge, updates include the releasing->released updates from the scheduler, but nothing yet from the new request (that will come in the next feedback message). When iterating over requester's rset, it will call rq._reconcile(None) on the new request sitting in the rset. In rq._reconcile, this relegates the new request to Released.

If I space the cancel and new requests enough (e.g. in a separate thread with enough sleep for SchedulerRequests to go back and forth from request to scheduler and back again) then I don't run into this problem (not a solution suggestion, just verification of a behaviour).

stonier commented 10 years ago

Basic problem seems to be piling up local requester request changes in esoteric places outside the requester feedback function which may cause the local and received rsets to be unhappy when compared to each other in the feedback function.

jack-oquin commented 10 years ago

I think you are right about the cause of this problem. As you say, a good solution is not obvious. Deleting obsolete requests has been a problem all along. It's basically a design bug in my current implementation.

I'll try to build a test case to reproduce the error, and see what I can figure out.

My fork of the package has some recent, substantial changes to the state transition code. You could try that it see if it helps, but I don't think it fixes this.

My next task was to redesign the merge and reconcile mechanism, anyway. If I can reproduce this failure in a test case, that will be a big help.

jack-oquin commented 10 years ago

Looking closer at the code, validate() should return False for a NEW -> RELEASED transition, and reconcile() should not replace the new request.

I need a good test case, so I can dig deeper. Sometimes these problems occur on the opposite side of the protocol than one would expect.

jack-oquin commented 10 years ago

I just created a merge() test case that reproduces my understanding of this situation.

Since the test case works with my current package, either: I inadvertently fixed the bug somehow, or more likely: I don't understand the problem yet.

@stonier: Please review the new test case to see if it actually recreates the problem you found. If so, we should try it with your hydro-devel fork to see if it works there, too.

Meanwhile, I'll work on a higher-level test case with separate requester and scheduler nodes. That is a lot more work, so I had hoped to catch it at the merge() level.

jack-oquin commented 10 years ago

Basic problem seems to be piling up local requester request changes in esoteric places outside the requester feedback function which may cause the local and received rsets to be unhappy when compared to each other in the feedback function.

Where are the request changes pilling up?

How many threads are there? Are things being done outside the main rospy callback thread?

stonier commented 10 years ago

Hrm, that's a good unit test case. I would have thought that would have exposed the problem.

For the record I was calling those from a separate thread which was monitoring the state and then doing the cancel/reissue after a certain timeout. I'm also occasionally finding some other odd behaviour that I don't totally understand yet when making various calls from inside the requester feedback function, but seeing as I don't understand fully the implications of what I'm doing yet, perhaps I'm calling things the wrong way. At any rate, if I can work out what works and what is dangerous, that is useful information.

Have been updating the robotics-in-concert/hydro-devel fork with your latest and greatests so we are testing on the same base...:)

I'll dig deeper this morning and update in a couple of hours.

jack-oquin commented 10 years ago

The rocon_scheduler_requests implementation is definitely not thread-safe if called from separately-created Python threads.

My understanding of the rospy threading model (which may be wrong) is that all topic and timer callbacks run in the main Python thread where rospy.init_node() and rospy.spin() are called.

If rospy.init_node() is called in some other thread, there may be difficulties with signal handling.

jack-oquin commented 10 years ago

I tried to create a test case for the example scheduler that would make requests similar to the problem reported here.

It does not work.

After step 2, the requests seem to get tangled up in repetitive canceled messages, which never get deleted.

Maybe that's a bug in the test case, but I committed the code so @stonier can look at it, too. The bug could also be in the example_scheduler.py node, which is relatively new and not well-tested.

I'll continue debugging it for a while longer tonight.

stonier commented 10 years ago

I have to catch up on what you've written - it might be something to do with what I see in the scheduler sometimes - my scheduler's update function seems to get caught in a recursive loop and called over and over (1-2ms spacing).

I did some parallel debugging to catch the merge problem - here's some output of a single merge operation which goes wrong.

Existing rset contains three requests (two releasing, one new)
Updates to the merge include two requests moving the releasing to released
- No update for the new request

It shows the reconciling of that new request against None, and the resultant released status of that request.

[WARN] [WallTime: 1387417766.804604] Merging Operations: 
  Update: 
    id: 949c450c-1d3a-4362-a562-11b5b89150b6
    priority: 10
    resources: 
      linux.*.ros.pc.dudettece37c1b32ff243528db562afae3fcab1/rocon_apps/talker
      linux.*.ros.pc.unallocated/rocon_apps/listener
      linux.*.ros.pc.dude02cd58dcb8314d3db6ccb136e57a3979/rocon_apps/listener
    status: 8
  Update: 
    id: 8d7bc6a9-3d81-4e50-96c9-323c9e49c533
    priority: 0
    resources: 
      linux.*.ros.pc.*/rocon_apps/listener
    status: 8
  Deleting: 
    id: 949c450c-1d3a-4362-a562-11b5b89150b6
    priority: 10
    resources: 
      linux.*.ros.pc.dudettece37c1b32ff243528db562afae3fcab1/rocon_apps/talker
      linux.*.ros.pc.unallocated/rocon_apps/listener
      linux.*.ros.pc.dude02cd58dcb8314d3db6ccb136e57a3979/rocon_apps/listener
    status: 7
   because of update
    id: 949c450c-1d3a-4362-a562-11b5b89150b6
    priority: 10
    resources: 
      linux.*.ros.pc.dudettece37c1b32ff243528db562afae3fcab1/rocon_apps/talker
      linux.*.ros.pc.unallocated/rocon_apps/listener
      linux.*.ros.pc.dude02cd58dcb8314d3db6ccb136e57a3979/rocon_apps/listener
    status: 8
  Deleting: 
    id: 8d7bc6a9-3d81-4e50-96c9-323c9e49c533
    priority: 0
    resources: 
      linux.*.ros.pc.*/rocon_apps/listener
    status: 7
   because of update
    id: 8d7bc6a9-3d81-4e50-96c9-323c9e49c533
    priority: 0
    resources: 
      linux.*.ros.pc.*/rocon_apps/listener
    status: 8
  Reconciling: 
    id: d00d61b6-6951-428a-9bc3-e6346f7e1486
    priority: 10
    resources: 
      linux.*.ros.pc.*/rocon_apps/listener
      linux.*.ros.pc.*/rocon_apps/listener
      linux.*.ros.pc.dudette/rocon_apps/talker
    status: 0
   with
    None

[ERROR] [WallTime: 1387417766.805344] Merged Request Set: 
  Request: 
    id: d00d61b6-6951-428a-9bc3-e6346f7e1486
    priority: 10
    resources: 
      linux.*.ros.pc.*/rocon_apps/listener
      linux.*.ros.pc.*/rocon_apps/listener
      linux.*.ros.pc.dudette/rocon_apps/talker
    status: 8

I need to compare this with the test unit case you have.

stonier commented 10 years ago

The test case in 25d61b49658d68c049af902a0d522fc8ad4e8400 should be doing the same thing - if the behaviour above is correct, then the following line should fail because your eq operator does take into account the status flag change:

self.assertEqual(rset, RequestSet([msg2], RQR_UUID,
 +                                          contents=ResourceReply))

I'll should run your test case and verify.

jack-oquin commented 10 years ago

I used to see similar topic spamming at about that rate when the rset comparison operators had bugs.

Maybe the scheduler callback always thinks the new request is different from the previous one.

Maybe the requester feedback handler is always modifying its request in a way that causes the updated message to ping-pong between two different values.

stonier commented 10 years ago

Same output style from the unit test case shows the merged set with updated status confirming the error (changed status from 0->8). For some reason the asserts aren't picking this up.

**************** test_canceled_merge_plus_new_request ****************
Merging Operations: 
  Update: 
    id: 01234567-89ab-cdef-fedc-ba9876543210
    priority: 0
    resources: 
      linux.precise.ros.segbot.roberto/test_rapp
    status: 8
  Reconciling: 
    id: 01234567-cdef-fedc-89ab-ba9876543210
    priority: 0
    resources: 
      linux.precise.ros.segbot.*/test_rapp
    status: 0
   with
    None
  Deleting: 
    id: 01234567-89ab-cdef-fedc-ba9876543210
    priority: 0
    resources: 
      linux.precise.ros.segbot.roberto/test_rapp
    status: 7
   because of update
    id: 01234567-89ab-cdef-fedc-ba9876543210
    priority: 0
    resources: 
      linux.precise.ros.segbot.roberto/test_rapp
    status: 8

Merged Request Set: 
  Request: 
    id: 01234567-cdef-fedc-89ab-ba9876543210
    priority: 0
    resources: 
      linux.precise.ros.segbot.*/test_rapp
    status: 8

jack-oquin commented 10 years ago

If I am interpreting your output correctly, this looks like a bug in the cancelled request deletion logic, right?

stonier commented 10 years ago

The unit test is giving us a false positive because in

self.assertEqual(rset, RequestSet([msg2], RQR_UUID,
                                          contents=ResourceReply))

it is using msg2 which is a reference to an object, which got updated in the merge - at this point it too also has status 8.

stonier commented 10 years ago

The cancel deletions are working fine I think - it's just that new request that gets modified from NEW (0) to RELEASED (8) when reconciling against 'None'.

jack-oquin commented 10 years ago

Excellent! Thanks for the help. Those changes in the deep structure are tricky.

jack-oquin commented 10 years ago

The unit test does fail when making the correct comparison. I wonder if some of my other test cases have similar bugs.

jack-oquin commented 10 years ago

The fix above is simple, you can apply it to your own branch, or test with an updated version from my master branch.

I had failed to distinguish clearly an important difference between schedulers, who never create new requests, and requesters, who do. Missing requests are significant to the scheduler, but should be ignored by the requester.

Thanks for the help, Daniel. Sorry this problem cost you so much time.

jack-oquin commented 10 years ago

The topic spamming problem with my new example_scheduler test still exists.

The requester and scheduler have gotten out of sync are are sending each other alternating requests:

scheduler callback:
id: e84706dc-b8e7-45b9-b9f8-5d122dce10ba
    priority: 0
    resources: 
      linux.precise.ros.turtlebot.marvin/example_rapp
    status: 2
scheduler callback:
id: 2a8236eb-769b-41a1-8c0d-26dfdd030212
    priority: 0
    resources: 
      linux.precise.ros.turtlebot.marvin/example_rapp
    status: 8
id: e84706dc-b8e7-45b9-b9f8-5d122dce10ba
    priority: 0
    resources: 
      linux.precise.ros.turtlebot.marvin/example_rapp
    status: 2
scheduler callback:
id: e84706dc-b8e7-45b9-b9f8-5d122dce10ba
    priority: 0
    resources: 
      linux.precise.ros.turtlebot.marvin/example_rapp
    status: 2

I am opening a new issue for that problem, and plan to work on it tomorrow.

stonier commented 10 years ago

Seems to work, going to close the issue and we can continue in the issue above.

utexas-bwi / rocon_scheduler_requests

Timing of new requests #19