utexas-bwi / rocon_scheduler_requests

Interfaces for managing rocon scheduler requests
http://wiki.ros.org/rocon_scheduler_requests
0 stars 3 forks source link

Make this module thread-safe? #24

Closed jack-oquin closed 10 years ago

jack-oquin commented 10 years ago

I am not sure this is a good idea, and I am reasonably certain a scheduler can easily be written using only the main rospy thread.

But, much ROCON code I've looked at seems to be heavily multi-threaded. I don't yet understand why, but changing it looks difficult.

Would adding mutex locking outside this module be effective? I don't know, so I am opening this issue to make sure we have a good answer.

@stonier, @piyushk: what do you think?

piyushk commented 10 years ago

I'm afraid I'm not up to speed on the code in this repo right now. Will do so next month after the IROS deadline.

In my opinion, ROCON code was written using threads because it was conceptually easier to do so, but certainly some time was spent trying to get the locking right. Threads were a problem during wireless robustness as python tcp hangs when the connection is lost, and somewhat ironically I ended up adding more threads to get the existing threads to exit gracefully.

If we do expect users to write schedulers, I would recommend avoiding threads and using the main rospy thread. A template could be provided as such. The core ROCON code is fairly complex, and this complexity should be avoided in code that users are supposed to write. Since I only work on ROCON code a few days every month, it takes me longer to wrap my head around the code every time than I would like to admit. I could see the average user being completely lost if asked to write a similarly heavily threaded piece of code.

I would recommend keeping the code as simple as possible.

stonier commented 10 years ago

I think I might be missing something here...Jack wrote in another issue:

Many applications will do all their work from various rospy timer or topic callbacks. I believe none of them run concurrently, they share the rospy.spin() thread.

As far as I know, nothing happens in rospy.spin(). It is just a blocker. As for topic callbacks, each topic's subscription gets its own thread and subsequently you should be careful what you do across different subscriber's callbacks. Refer to this ros answers post by Ken and I've also done a cursory scan of rospy's Topic related classes to confirm this. I've not yet looked at where and how timer callbacks are triggered.

I'm happy to be educated about this if you have suggestions - I was a total python newbie when starting the gateway, haven't looked at rospy in all its gory detail (it's even harder to follow than roscpp) and am still far from what I'd call an expert. ;)

I'm wondering if you're talking about the gateway itself piyush. Rocon outside the gateway hardly makes any use of threads - there's less than a handful of worker threads to avoid long service callbacks or monitoring watchdogs. That particular beast (the gateway) rapidly grew beyond our original thoughts. It might be worth drawing up a redesign to help us finish off the wireless. I'll get my head back into it this week - I'd like to move past that particular milestone.

stonier commented 10 years ago

+1 for keeping complexity down on user-side requester and scheduler development. Working on that though can come after we see it all working and have a few examples to understand where lies the complexity.

piyushk commented 10 years ago

I can confirm Daniel's comment on how rospy.spin() works. Everything in python is a separate thread and handled independently. That includes timers and subscriber callbacks.

jack-oquin commented 10 years ago

Good thing you two are available to straighten me out. I never actually looked at the rospy code, just assumed it worked similarly to roscpp.

So, why have I never noticed a need for locks in normal rospy nodes? Inadequate testing, maybe? Is it just because of the Python Global Interpreter Lock? If I understand the GIL correctly, the Python interpreter only runs code in one OS thread at a time, but it periodically checks for signals and may run another Python thread when a message arrives.

That sounds like a serious problem. I'll investigate further, but I think we do need locking in this package.

piyushk commented 10 years ago

@stonier Also. I would be more than happy to help with reducing the gateway complexity. Specifically, I want to cut back on the number of parallel threads if possible, and make it easier to come out of TCP hangs. I will have a lot of time in Feb and March to work on rocon code.

@jack-oquin As far as I know, normal rospy nodes should use locks as necessary. I typically don't use locks unless there is a write-write conflict. This may cause some small inconsistencies in my code, but if I am subscribed to something like odometry, then the errors are typically insignificant.

stonier commented 10 years ago

@piyushk I'll try and draw what the gateway looks like (like every good computer scientist should do before they program!) this week. That I think will be easier to start scratching plans from than getting lost in the jungle of code.

@jack-oquin Didn't know about the GIL. That's a really big performance bottleneck on a multi-core machine. But then we already have a good alternative (c++) and enough reasons not to use python for anything serious with respect to control performance ;)

Another interesting conundrum I unearthed this week while scanning some of Ken's code is in rospy's topics.py. Do a search for the text 'avoid' in that file and you'll find comments where he explicitly states that copying a reference lets him avoid using a lock. I'd love an explanation on that.

jack-oquin commented 10 years ago

@jack-oquin Didn't know about the GIL. That's a really big performance bottleneck on a multi-core machine. But then we already have a good alternative (c++) and enough reasons not to use python for anything serious with respect to control performance ;)

That is one of the arguments the Python developers make for why the GIL is not a show-stopper.

Back in my Unix days, we had a similar issue the the Big Kernel Lock. It simplified kernel code considerably, but forced the kernel to run essentially single-threaded. We removed it from AIX version 3 in 1990, the first major commercial Unix kernel to do so. That was not easy, and multi-core machines were not common in those days.

Another interesting conundrum I unearthed this week while scanning some of Ken's code is in rospy's topics.py. Do a search for the text 'avoid' in that file and you'll find comments where he explicitly states that copying a reference lets him avoid using a lock. I'd love an explanation on that.

Ken is a really expert Python programmer.

I believe assignment is an atomic operation in the Python Virtual Machine, so no lock is required if you operate on a copy of the data and then update it with an assignment statement. Of course, without making a deep copy, subordinate changes may still effect the results. So, the logic is actually quite complex.

stonier commented 10 years ago

I believe assignment is an atomic operation in the Python Virtual Machine, so no lock is required if you operate on a copy of the data and then update it with an assignment statement. Of course, without making a deep copy, subordinate changes may still effect the results. So, the logic is actually quite complex.

I do the copy thing quite often as well. There are some instances there that I don't understand though - he's not making a copy - but rather a copy of the reference. e.g.

def has_connection(self, endpoint_id):
        # save reference to avoid lock
        conn = self.connections
        for c in conn:
            if c.endpoint_id == endpoint_id:
                return True
        return False

Actually, just worked out why this works. He only ever reassigns the reference for the whole self.connections list (never inserts nor deletes items) in other parts of the code. That means the reference that is getting saved is guaranteed not to change, even if he does reassign the reference for the actual self.connections.

So nothing useful here...just some black magic from Ken! We have to be appropriately careful and use locks. The question is where.

jack-oquin commented 10 years ago

Since we won't see much concurrency, anyway, due to the GIL, I prefer coarse-grained locking for rocon_scheduler_requests. Most of it can be done fairly simply with a lock in each Scheduler or Requester instance.

Client code using only those methods should be safe. The lock would be held during callbacks.

I am more concerned about schedulers operating on queued active requests. I don't want to introduce more locks at that level, if I can avoid it.

jack-oquin commented 10 years ago

The previous commit introduces coarse-grained locking for both schedulers and requesters:

These locks are used within the Scheduler and Requester classes. They are also documented and exposed to client code, which should use them when for updating shared request set objects.

I considered adding finer-grained locking in various RequestSet objects, but that would be much more complex. The high-level locks are still needed, and they might interact badly with those locks.

jack-oquin commented 10 years ago

The big lock approach works fine in my experimental scheduler node. I believe the current solution is adequate.

I am closing this issue. If it turns out we need finer-grained locking, we can create a new one.