Closed stonier closed 10 years ago
Basic problem seems to be piling up local requester request changes in esoteric places outside the requester feedback function which may cause the local and received rsets to be unhappy when compared to each other in the feedback function.
I think you are right about the cause of this problem. As you say, a good solution is not obvious. Deleting obsolete requests has been a problem all along. It's basically a design bug in my current implementation.
I'll try to build a test case to reproduce the error, and see what I can figure out.
My fork of the package has some recent, substantial changes to the state transition code. You could try that it see if it helps, but I don't think it fixes this.
My next task was to redesign the merge and reconcile mechanism, anyway. If I can reproduce this failure in a test case, that will be a big help.
Looking closer at the code, validate() should return False for a NEW -> RELEASED transition, and reconcile() should not replace the new request.
I need a good test case, so I can dig deeper. Sometimes these problems occur on the opposite side of the protocol than one would expect.
I just created a merge() test case that reproduces my understanding of this situation.
Since the test case works with my current package, either: I inadvertently fixed the bug somehow, or more likely: I don't understand the problem yet.
@stonier: Please review the new test case to see if it actually recreates the problem you found. If so, we should try it with your hydro-devel fork to see if it works there, too.
Meanwhile, I'll work on a higher-level test case with separate requester and scheduler nodes. That is a lot more work, so I had hoped to catch it at the merge() level.
Basic problem seems to be piling up local requester request changes in esoteric places outside the requester feedback function which may cause the local and received rsets to be unhappy when compared to each other in the feedback function.
Where are the request changes pilling up?
How many threads are there? Are things being done outside the main rospy callback thread?
Hrm, that's a good unit test case. I would have thought that would have exposed the problem.
For the record I was calling those from a separate thread which was monitoring the state and then doing the cancel/reissue after a certain timeout. I'm also occasionally finding some other odd behaviour that I don't totally understand yet when making various calls from inside the requester feedback function, but seeing as I don't understand fully the implications of what I'm doing yet, perhaps I'm calling things the wrong way. At any rate, if I can work out what works and what is dangerous, that is useful information.
Have been updating the robotics-in-concert/hydro-devel fork with your latest and greatests so we are testing on the same base...:)
I'll dig deeper this morning and update in a couple of hours.
The rocon_scheduler_requests implementation is definitely not thread-safe if called from separately-created Python threads.
My understanding of the rospy threading model (which may be wrong) is that all topic and timer callbacks run in the main Python thread where rospy.init_node() and rospy.spin() are called.
If rospy.init_node() is called in some other thread, there may be difficulties with signal handling.
I tried to create a test case for the example scheduler that would make requests similar to the problem reported here.
It does not work.
After step 2, the requests seem to get tangled up in repetitive canceled messages, which never get deleted.
Maybe that's a bug in the test case, but I committed the code so @stonier can look at it, too. The bug could also be in the example_scheduler.py node, which is relatively new and not well-tested.
I'll continue debugging it for a while longer tonight.
I have to catch up on what you've written - it might be something to do with what I see in the scheduler sometimes - my scheduler's update function seems to get caught in a recursive loop and called over and over (1-2ms spacing).
I did some parallel debugging to catch the merge problem - here's some output of a single merge operation which goes wrong.
It shows the reconciling of that new request against None, and the resultant released status of that request.
[WARN] [WallTime: 1387417766.804604] Merging Operations:
Update:
id: 949c450c-1d3a-4362-a562-11b5b89150b6
priority: 10
resources:
linux.*.ros.pc.dudettece37c1b32ff243528db562afae3fcab1/rocon_apps/talker
linux.*.ros.pc.unallocated/rocon_apps/listener
linux.*.ros.pc.dude02cd58dcb8314d3db6ccb136e57a3979/rocon_apps/listener
status: 8
Update:
id: 8d7bc6a9-3d81-4e50-96c9-323c9e49c533
priority: 0
resources:
linux.*.ros.pc.*/rocon_apps/listener
status: 8
Deleting:
id: 949c450c-1d3a-4362-a562-11b5b89150b6
priority: 10
resources:
linux.*.ros.pc.dudettece37c1b32ff243528db562afae3fcab1/rocon_apps/talker
linux.*.ros.pc.unallocated/rocon_apps/listener
linux.*.ros.pc.dude02cd58dcb8314d3db6ccb136e57a3979/rocon_apps/listener
status: 7
because of update
id: 949c450c-1d3a-4362-a562-11b5b89150b6
priority: 10
resources:
linux.*.ros.pc.dudettece37c1b32ff243528db562afae3fcab1/rocon_apps/talker
linux.*.ros.pc.unallocated/rocon_apps/listener
linux.*.ros.pc.dude02cd58dcb8314d3db6ccb136e57a3979/rocon_apps/listener
status: 8
Deleting:
id: 8d7bc6a9-3d81-4e50-96c9-323c9e49c533
priority: 0
resources:
linux.*.ros.pc.*/rocon_apps/listener
status: 7
because of update
id: 8d7bc6a9-3d81-4e50-96c9-323c9e49c533
priority: 0
resources:
linux.*.ros.pc.*/rocon_apps/listener
status: 8
Reconciling:
id: d00d61b6-6951-428a-9bc3-e6346f7e1486
priority: 10
resources:
linux.*.ros.pc.*/rocon_apps/listener
linux.*.ros.pc.*/rocon_apps/listener
linux.*.ros.pc.dudette/rocon_apps/talker
status: 0
with
None
[ERROR] [WallTime: 1387417766.805344] Merged Request Set:
Request:
id: d00d61b6-6951-428a-9bc3-e6346f7e1486
priority: 10
resources:
linux.*.ros.pc.*/rocon_apps/listener
linux.*.ros.pc.*/rocon_apps/listener
linux.*.ros.pc.dudette/rocon_apps/talker
status: 8
I need to compare this with the test unit case you have.
The test case in 25d61b49658d68c049af902a0d522fc8ad4e8400 should be doing the same thing - if the behaviour above is correct, then the following line should fail because your eq operator does take into account the status flag change:
self.assertEqual(rset, RequestSet([msg2], RQR_UUID,
+ contents=ResourceReply))
I'll should run your test case and verify.
I used to see similar topic spamming at about that rate when the rset comparison operators had bugs.
Maybe the scheduler callback always thinks the new request is different from the previous one.
Maybe the requester feedback handler is always modifying its request in a way that causes the updated message to ping-pong between two different values.
Same output style from the unit test case shows the merged set with updated status confirming the error (changed status from 0->8). For some reason the asserts aren't picking this up.
**************** test_canceled_merge_plus_new_request ****************
Merging Operations:
Update:
id: 01234567-89ab-cdef-fedc-ba9876543210
priority: 0
resources:
linux.precise.ros.segbot.roberto/test_rapp
status: 8
Reconciling:
id: 01234567-cdef-fedc-89ab-ba9876543210
priority: 0
resources:
linux.precise.ros.segbot.*/test_rapp
status: 0
with
None
Deleting:
id: 01234567-89ab-cdef-fedc-ba9876543210
priority: 0
resources:
linux.precise.ros.segbot.roberto/test_rapp
status: 7
because of update
id: 01234567-89ab-cdef-fedc-ba9876543210
priority: 0
resources:
linux.precise.ros.segbot.roberto/test_rapp
status: 8
Merged Request Set:
Request:
id: 01234567-cdef-fedc-89ab-ba9876543210
priority: 0
resources:
linux.precise.ros.segbot.*/test_rapp
status: 8
If I am interpreting your output correctly, this looks like a bug in the cancelled request deletion logic, right?
The unit test is giving us a false positive because in
self.assertEqual(rset, RequestSet([msg2], RQR_UUID,
contents=ResourceReply))
it is using msg2 which is a reference to an object, which got updated in the merge - at this point it too also has status 8.
The cancel deletions are working fine I think - it's just that new request that gets modified from NEW (0) to RELEASED (8) when reconciling against 'None'.
Excellent! Thanks for the help. Those changes in the deep structure are tricky.
The unit test does fail when making the correct comparison. I wonder if some of my other test cases have similar bugs.
The fix above is simple, you can apply it to your own branch, or test with an updated version from my master branch.
I had failed to distinguish clearly an important difference between schedulers, who never create new requests, and requesters, who do. Missing requests are significant to the scheduler, but should be ignored by the requester.
Thanks for the help, Daniel. Sorry this problem cost you so much time.
The topic spamming problem with my new example_scheduler test still exists.
The requester and scheduler have gotten out of sync are are sending each other alternating requests:
scheduler callback:
id: e84706dc-b8e7-45b9-b9f8-5d122dce10ba
priority: 0
resources:
linux.precise.ros.turtlebot.marvin/example_rapp
status: 2
scheduler callback:
id: 2a8236eb-769b-41a1-8c0d-26dfdd030212
priority: 0
resources:
linux.precise.ros.turtlebot.marvin/example_rapp
status: 8
id: e84706dc-b8e7-45b9-b9f8-5d122dce10ba
priority: 0
resources:
linux.precise.ros.turtlebot.marvin/example_rapp
status: 2
scheduler callback:
id: e84706dc-b8e7-45b9-b9f8-5d122dce10ba
priority: 0
resources:
linux.precise.ros.turtlebot.marvin/example_rapp
status: 2
I am opening a new issue for that problem, and plan to work on it tomorrow.
Seems to work, going to close the issue and we can continue in the issue above.
I seem to be running into trouble in ResourceRequests' _reconcile method.
Usage scenario:
Here it looks like it can get into trouble.
Inside merge, updates include the releasing->released updates from the scheduler, but nothing yet from the new request (that will come in the next feedback message). When iterating over requester's rset, it will call rq._reconcile(None) on the new request sitting in the rset. In rq._reconcile, this relegates the new request to Released.
If I space the cancel and new requests enough (e.g. in a separate thread with enough sleep for SchedulerRequests to go back and forth from request to scheduler and back again) then I don't run into this problem (not a solution suggestion, just verification of a behaviour).