Open bartlettroscoe opened 8 years ago
Developers at Tech-X have created a branch with updates to help make RCP thread-safe. It is on the branch 'atomic_rcp' in the Trilinos fork git@github.com:bddavid/Trilinos.git
. The file
RCP_Summary_02012016.pdf describes the changes and gives some timing studies.
@bddavid, I will pull this branch and then start looking it over and running my own tests. In particular, I need to create some unit tests in the Teuchos package that show the various threading use cases. I will put these in my branch also. I will create a pull-request to allow for a more detailed code review.
The early document ThreadSafeRCP.atomics.pdf gives an example main()
function that demonstrates the failure of thread-safety of the RCP software.
I fixed the debug build of the code and it passed all of the TeuchosCore unit tests. Next I will add some unit tests to check various use cases for thread safety. I think that the code is not thread safe in a debug-mode build when creating new RCP<T>
objects in different threads but the unit tests will show this.
I have outlined below in a little more detail what I have done so far and what needs to be done next. The next task it to write multi-threaded unit tests for various use cases.
Detailed Notes:
I built the TeuchosCore subpackage on the branch thread-safe-rcp-229
(see PR #230) as of commit:
commit 1b1326a0edd4694f78daa050e89c17f8afba865f
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Fri Mar 18 22:50:35 2016 -0400
Fix debug build (Trilinos #229)
The code did not build for -DTrilinos_ENABLE_DEBUG=ON. With this, the
TeuchosCore NIGHTLY test suite passes.
M packages/teuchos/core/src/Teuchos_RCPNode.hpp
and run the NIGHTLY test suite run with:
$ ./checkin-test-fissile4.sh --st-extra-builds= --no-enable-fwd-packages \
--enable-package=TeuchosCore --test-categories=NIGHTLY --local-do-all
and it passed all the tests with:
Build test results:
-------------------
0) MPI_DEBUG => passed: passed=52,notpassed=0 (2.48 min)
1) SERIAL_RELEASE => passed: passed=47,notpassed=0 (1.19 min)
This tests that single-thread behavior is maintained, but there are no automated multi-thread tests at all.
We need to write some multi-threaded unit tests for the Teuchos MM classes to test thread-safety of all behavior, especially in debug mode builds. We need to come up with a good strategy for writing and running these tests. Really, we should run all of the unit tests with multiple threads to make sure that all of this functionality works. But I am afraid that many of these unit tests will need to be specially designed in order to create strong tests for multiple threads. Basically, any memory that can be accessed by more than one thread needs to have tests where multiple threads read and write to that memory at the same time. There are several objects that are shared between threads that are stored in the base RCPNode object:
There are several test cases that need to be carefully tested. Some of the obvious tests include:
1) Incrementing and deincrementing RC in multiple threads and make sure RC is correct at the end (this is the test case that Tech-X showed at the end of the document ThreadSafeRCP.atomics.pdf)
2) Allocate new independent RCP objects in multiple threads and deallocate the object in those threads. (This will test if debug-mode RCP node tracing is thread safe, it is likely not right now.)
3) Create singleton to create new shared RCP object to call from multiple threads and then remove reference in multiple threads and make sure the shared object gets deleted cleanly.
4) Create singleton to create new shared RCP object with specialized deallocation policy to call from multiple threads and then remove reference in multiple threads and make sure shared object gets deleted cleanly.
5) Create new object with raw pointer in main thread, give to singleton to create new owning shared RCP object to call from multiple threads and then remove reference in multiple threads but set ownership to false in one thread and make sure that the object does not get deleted after all the threads finish.
6) Create singleton to create new shared RCP object with extra data to delete a dependent object that gets called in destructor then have reference count go to zero in each of the threads.
7) Create a unit test that shows usage of a dangling WEAK RCP in one thread when the underlying object is deleted in other thread with a STRONG RCP object.
The issue is that many of these multi-threaded tests will likely fail given what is there right now. What is most likely to fail are the unit tests where different RCPNode objects are created and destroyed in multiple threads when node tracing is turned on.
And we will also need to add automated tests like these involving for Teuchos classes ArrayRCP, ArrayView, Ptr, and the Tuple. That could be a lot of work.
@bartlettroscoe I've built https://github.com/bartlettroscoe/Trilinos/tree/thread-safe-rcp-229 and could help with adding unit tests. For starters I can add the Tech-X test case you refer to above.
Would it make sense to create new tickets for the different unit tests you propose?
For starters I can add the Tech-X test case you refer to above.
@d-meiser, that would be fantastic! I created a skeleton executable for doing this. See ???. If you look at the other XXX_UnitTests.cpp
files in that directory, you should see how to add unit tests. Just grab threads as with the Kitware example. Try not to use more than 4 threads for now, or these tests will not run by default on the Trilinos test machines. I can help with the details of running the tests.
Would it make sense to create new tickets for the different unit tests you propose?
I think we can use this ticket. I don't think we don't need that level of granularity. We are not doing SCRUM so we don't need to have the story broken up artificially to fit into a "sprint". This is a big story but I think that all the Teuchos MM classes need to be thread-safe, including in debug-mode, in order to provide real value. We may break off some Task tickets if there is a lot of independent streams of dialog.
I pull --rebase
d your branch and see the test skeleton. I'll use that to create the unit tests.
Should I open pull requests to your branch so you can update this PR?
Starting work on the suggested multithread tests and had a few questions:
2) Allocate new independent RCP objects in multiple threads and deallocate the object in those threads. (This will test if debug-mode RCP node tracing is thread safe, it is likely not right now.)
I setup this test and got errors on the node tracing as you indicated would happen - so only multithreaded debug caused the node tracing to be mangled. My first idea is to add a single static mutex on the static addNewRCPNode and removeRCPNode() functions - not sure yet if it would be better/necessary to make this more efficient and reorganize the internals. Then found there was a more subtle issue where node_->delete_obj() can be called and occasionally, before removeRCPNode() can execute, another thread can create a new node in the same memory space and incorrectly think it was already added to the trace.
To fix this I moved (for debug only) the node_->delete_obj() call into the removeRCPNode() so it is protected by the mutex. This means delete_obj() and removeRCPNode() must happen as a pair now. I’m not exactly happy about that organization of delete_obj() since it doest cleanly parallel release, so will think on this, but felt some feedback might be best before I continue further on that test.
3) Create singleton to create new shared RCP object to call from multiple threads and then remove reference in multiple threads and make sure the shared object gets deleted cleanly.
I think we are talking about having the singleton not ‘own’ the data so if all threads are done, memory should be deallocated automatically at that time, not when the program ends, but wasn’t sure. I started with a setup where I have the threads constantly making requests and deleting so the counter of the shared RCP frequently goes to 0 (deallocates) or goes from 0 to 1 (allocates fresh new). Most useful would be any tips on the expected state of that ptr when no thread is using it to be sure I understand your idea - my first thought was it should end the test as a single weak count (strong = 0, weak = 1). Related to issue 7, I’m not clear yet on exactly what a weak ptr should be allowed to do when all strong counts go away. Also not sure if that was your intention here anyways.
4) Create singleton to create new shared RCP object with specialized deallocation policy to call from multiple threads and then remove reference in multiple threads and make sure shared object gets deleted cleanly.
I saw some examples in the non-multithreaded tests for this so I expect once i have #3 done, this should follow smoothly.
5) Create new object with raw pointer in main thread, give to singleton to create new owning shared RCP object to call from multiple threads and then remove reference in multiple threads but set ownership to false in one thread and make sure that the object does not get deleted after all the threads finish.
If you can clarify what you mean by setting ownership to false that would be great. On first inspection it seemed the has_ownership flag was designed to be set on constructor only - at least from the RCP standpoint. So it was not clear that there was a ‘good' way to change ownership of RCP after construction - I will look into this further.
Also you indicated the data would be allocated in main, then passed to the singleton. In case 3 for example, I i was initially thinking the singleton would handle the allocation. There may be subtle reasons you said it this way which I am missing - or perhaps if the main idea is to make sure the data is preserved when one ptr gives up ownership, it may not matter for this one.
6) Create singleton to create new shared RCP object with extra data to delete a dependent object that gets called in destructor then have reference count go to zero in each of the threads.
I think the prior unit tests as examples and any insight you have on #3 will give me a fairly clear path on this one.
7) Create a unit test that shows usage of a dangling WEAK RCP in one thread when the underlying object is deleted in other thread with a STRONG RCP object.
I tried having the main thread create an RCP, send a weak ptr of that RCP to a second thread, and then the first strong ptr goes out of scope. Now the second thread can be upset as the ptr will become invalid at some unexpected time. Does this sound like the right direction for this test? Same as #3, just starting to get a feel for what the weak ptr can do (or should never do) when the strong count is 0.
Thanks!
hi Ross - one additional thing I have been doing is building some parallel cases using std::shared_ptr and std::weak_ptr. So for example, I see I can generate an intermittent problem with threads for the #3 singleton case with RCP which does not occur for a parallel case using shared_ptr/weak_ptr - which I will be investigating now. Does it make sense to describe my goal here as: We would like RCP to work for the same situations as shared_ptr and weak_ptr. I am becoming aware of some distinctions (such as RCP holds strong/weak as a pair) but if that idea is generally correct it can help guide me. If there are things shared_ptr/weak_ptr can do in a multithreaded context that we don't want to worry about for RCP, that information would be valuable for me to define the scope of this work. Many thanks.
hi Ross - one additional thing I have been doing is building some parallel cases using std::shared_ptr and std::weak_ptr. So for example, I see I can generate an intermittent problem with threads for the #3 singleton case with RCP which does not occur for a parallel case using shared_ptr/weak_ptr - which I will be investigating now. Does it make sense to describe my goal here as: We would like RCP to work for the same situations as shared_ptr and weak_ptr. I am becoming aware of some distinctions (such as RCP holds strong/weak as a pair) but if that idea is generally correct it can help guide me. If there are things shared_ptr/weak_ptr can do in a multithreaded context that we don't want to worry about for RCP, that information would be valuable for me to define the scope of this work. Many thanks.
Michel, I am responding to your larger comment but I will address this question first ...
I had not thought about how strong and weak RCPs would relate in behavior to std::shared_ptr and std::weak_ptr. I suspect that the way RCP is currently set up will need some tweaking so that it has reasonable behavior in a multi-threaded application. The key issue is that with std::weak_ptr, you can't directly access the shared object. You have to create a strong owning std::shared_ptr first. The problem with the current implementation of the Teuchos::RCP classes, is that if you access the shared object through a weak RCP, then it will do a debugt-mode (atomic) assert that the shared object has not been deleted yet, and then return a reference. The problem is that the object could then be deleted by the other thread before the client is finished using it. Therefore, I think in a multi-threaded env, it may not be safe to access the object directly through a weak RCP object (at least not without some major and expensive magic).
For now, don't worry about the behavior of weak and strong RCP objects in a multi-threaded program. After we get basic thread safely working with all strong RCP objects, then I can go back and think how to address weak and strong RCPs in a multi-threaded env. It may be that you can't allow the direct access of an object through a weak RCP in a multi-threaded program (if you try, then it will throw an exception).
Thanks for thinking about this ahead of time!
hi - That makes sense. One thing I am struggling with is to determine exactly the scope (what things we need to solve and what things we don't). I think your responses on the specific tests will be most helpful so I clearly understand the purpose of the test and as we move along I'll have a better intuition on what is needed.
(Edited to make more concise) I suppose looking ahead we would need the equivalent of the weak_ptr.lock(), which allows conversion to shared_ptr or null if not available. So that would need thread protection and would also hurt performance.
However it sounds like that is an aside from our main goal and for the initial tests I will definitely follow your lead. Thanks.
Starting work on the suggested multithread tests
Good deal!
2) Allocate new independent RCP objects in multiple threads and deallocate the object in those threads. (This will test if debug-mode RCP node tracing is thread safe, it is likely not right now.)
I setup this test and got errors on the node tracing as you indicated would happen - so only multithreaded debug caused the node tracing to be mangled. My first idea is to add a single static mutex on the static addNewRCPNode and removeRCPNode() functions - not sure yet if it would be better/necessary to make this more efficient and reorganize the internals.
A mutex is fine in debug-mode only code. We don't care so much about the runtime performance of debug-mode code. We would have to be very sloppy to make the Teuchos MM classes run as slowly in debug-mode as using a tool like valgrind. And since these functions only get called when the underlying object is created or destroyed, the overhead of a mutex should not be so bad compared to the dynamic memory allocation/deallocation work.
Then found there was a more subtle issue where node_->delete_obj() can be called and occasionally, before removeRCPNode() can execute, another thread can create a new node in the same memory space and incorrectly think it was already added to the trace.
That is interesting. What use cases does that happen for? What unit tests fail?
To fix this I moved (for debug only) the node_->delete_obj() call into the removeRCPNode() so it is protected by the mutex. This means delete_obj() and removeRCPNode() must happen as a pair now. I’m not exactly happy about that organization of delete_obj() since it doest cleanly parallel release, so will think on this, but felt some feedback might be best before I continue further on that test.
We can look at how to refactor this more cleanly once we get all of the tests working. Initially, I don't mind if it is a little messy.
3) Create singleton to create new shared RCP object to call from multiple threads and then remove reference in multiple threads and make sure the shared object gets deleted cleanly.
I think we are talking about having the singleton not ‘own’ the data so if all threads are done, memory should be deallocated automatically at that time, not when the program ends, but wasn’t sure.
I think for this to be a good test, the singleton would need to give up ownerhship after all of the clients were given an RCP to the shared object. What we want this test to do is to make sure that a shared object gets deleted correctly when multiple threads all race to delete it at the same time (i.e. remove their reference counts, not actually call 'delete').
I started with a setup where I have the threads constantly making requests and deleting so the counter of the shared RCP frequently goes to 0 (deallocates) or goes from 0 to 1 (allocates fresh new). Most useful would be any tips on the expected state of that ptr when no thread is using it to be sure I understand your idea - my first thought was it should end the test as a single weak count (strong = 0, weak = 1).
Another way to accomplish this without a singleton might be to just create the object with a strong RCP in the main thread, and then give the RCP to a bunch of slave threads and then the main thread would set its copy to null then wait on the slave threads which will remove their references. I think that would create the race to remove references and a race to delete between threads. Does that make sense?
Related to issue 7, I’m not clear yet on exactly what a weak ptr should be allowed to do when all strong counts go away. Also not sure if that was your intention here anyways.
In normal code, if the strong count goes to 0, then the managed object would be deleted. The semantics of a weak RCPs are not well defined in threaded code (see my previous comment).
4) Create singleton to create new shared RCP object with specialized deallocation policy to call from multiple threads and then remove reference in multiple threads and make sure shared object gets deleted cleanly.
I saw some examples in the non-multithreaded tests for this so I expect once i have #3 done, this should follow smoothly.
Yes, this is really the same as case-3, just with a custom dealloation policy.
5) Create new object with raw pointer in main thread, give to singleton to create new owning shared RCP object to call from multiple threads and then remove reference in multiple threads but set ownership to false in one thread and make sure that the object does not get deleted after all the threads finish.
If you can clarify what you mean by setting ownership to false that would be great. On first inspection it seemed the has_ownership flag was designed to be set on constructor only - at least from the RCP standpoint. So it was not clear that there was a ‘good' way to change ownership of RCP after construction - I will look into this further.
Teuchos::RCP has the ability to allow any client to remove ownership of the RCP to delete the shared managed object by calling the member function release():
This was added 15 years ago or so to address some really messed up applications that had really ugly memory management. This allows you to change your mind about how the object gets deleted after the first RCP object is created (you can't do that with std::shared_ptr). But you pay the price of the storage of an extra bool in the RCPNode object to allow that.
Also you indicated the data would be allocated in main, then passed to the singleton. In case 3 for example, I i was initially thinking the singleton would handle the allocation. There may be subtle reasons you said it this way which I am missing - or perhaps if the main idea is to make sure the data is preserved when one ptr gives up ownership, it may not matter for this one.
It does not matter how the initial allocation is done. We just want to test the race condition to set ptr=null in each thread and make sure that the object gets deleted correctly, even in debug-mode with tracing.
6) Create singleton to create new shared RCP object with extra data to delete a dependent object that gets called in destructor then have reference count go to zero in each of the threads.
I think the prior unit tests as examples and any insight you have on #3 will give me a fairly clear path on this one.
Similar to case-3 and case-4, except let an "extra object" do the deletion. Support for "extra data" was added about 14 years ago in order to address some very ugly use cases with badly structured C++ code that existed at the time.
7) Create a unit test that shows usage of a dangling WEAK RCP in one thread when the underlying object is deleted in other thread with a STRONG RCP object.
I tried having the main thread create an RCP, send a weak ptr of that RCP to a second thread, and then the first strong ptr goes out of scope. Now the second thread can be upset as the ptr will become invalid at some unexpected time. Does this sound like the right direction for this test? Same as #3, just starting to get a feel for what the weak ptr can do (or should never do) when the strong count is 0.
Because of the uncertainty about the behavior of weak RCPs (as described above), I think we can hold off on this unit test for threaded code.
One thing I am struggling with is to determine exactly the scope (what things we need to solve and what things we don't).
Since we are conducting this work in an agile way, the scope does not need to be fixed. We will just do the highest priority work and then see if there is time and funding to do more.
Thanks - I am going to be working on this now and will respond today once I can try your ideas. It sounds like the 2nd test case for node tracing is on track and I think these notes can guide me well.
You asked about the node->delete_obj() and which test cases it fails for. To clarify, my interpretation was that this was a multithreading issue only - so initially I only looked at that new test I made which allocates and deletes nodes constantly in multithreading/debug mode. I assume the other tests I will be adding would also trigger this issue in debug mode. I think the fix resolves it but have not extensively tested yet and will keep you updated. Please let me know if that does not address your question.
hi - To answer your question more explicitly - All tests seemed fine in debug with or without the new changes, except this one new test I created. The new test can have random memory problems so only seems ok with the new fixes (mutex and moving delete_obj) into the mutex lock. So I am concluding this particular fix is only relevant to Debug + Multithreading situations - and I think that makes sense for this situation.
So I am concluding this particular fix is only relevant to Debug + Multithreading situations - and I think that makes sense for this situation.
Yes, of course. Thanks for pointing that out.
hi - here's an update.
Working on issue #3. I set up as you suggested - so a single RCP is passed to several threads, then sets to null in the main thread (while others are still running). When the threads stop they can cause a race condition and skip deleting the final object. It's rare and in my current setup I often loop the entire test thousands of times before it hits - so can try to optimize it. But I suspect we may have several rare race conditions to deal with in the delete path.
The most common one I hit and am investigating first is probably mostly happening in the unbind() function. Two threads race so initially strong count is 2, they both decrement together so count is 0, they both trigger the unbindOne(), then both increment (count is 2), then they enter unbindOne and skip delete which looks for count 1. Seems we can also have a double delete in some cases but still figuring it out.
I think deincr_count() can race internally and maybe it would be better not to have this return anything so we always check the count as a separate inline. But I think generally there is a deeper problem here which will require some control scheme to make sure the delete process is controlled by a single owner. Still working to figure out the issues and will try to make at least one solution, if not the best, to convince myself I've found all the problems.
Also a philosophical question: For a system like this would it ever be desirable to ignore a rare crash or conflict issue in favor of efficiency? I can imagine that for normal use multithreaded cross over would occur in a way to make these errors rare and perhaps exceptionally rare.
Exceptional rare is not a very nice argument if you want to run for a month on a million cores. Even if it would only fail every 100 million thread hours, that means you get this thing crashing every 4 days. So I believe this has to be absolute thread safe, no short cuts allowed. We also want it efficient in release builds.
I agree with @crtrott, failure of our software in field reflects exceptionally badly on the lab and the project, worse, because it cannot always be detected it creates fire drill situations which pull in many people to hunt for the potential bugs. In my view everything we do should be prioritized to provide robust software before performance. Being able to depend on a piece of software is frequently the most important thing users tell us they look for when deciding what can/cannot be used with their code.
I agree with @crtrott, failure of our software in field reflects exceptionally badly on the lab and the project, worse, because it cannot always be detected it creates fire drill situations which pull in many people to hunt for the potential bugs. In my view everything we do should be prioritized to provide robust software before performance. Being able to depend on a piece of software is frequently the most important thing users tell us they look for when deciding what can/cannot be used with their code.
Yes, the thread-safe reference counting has to be 100% correct and robust and can never fail. If that requires some refactoring or even breaking some backward compatibility, then that is what needs to happen. For example, as mentioned above, that may mean that we can't allow direct object access through a WEAK RCP object. However, I can think of an idiom that for usage of a possible WEAK RCP that will be fairly efficient and still allow for transparency for if an RCP is WEAK or STRONG (as apposed to std::shared_ptr and std::weak_ptr which are more rigid). (Basically, you could create an ActiveRCP stack-based class that would create a WEAK RCP when the stored RCP was WEAK but would directly use the stored RCP when it was STRONG.)
But if we break backward compatibility, then we can't make this the default until the Trilinos 13.0 release. There are a variety of ways to do that (ifdef at configure time, or template argument to RCP objects, etc.). But on this branch we just need to make the Teuchos MM classes 100% thread safe then we will refactor this to address these other concerns. Once we have passing strong tests, such refactoring should be pretty straightforward.
I'm completely on board and wanting to get a broader perspective how the addition of multithreaded safety changes the design of these projects. I expect its desirable to keep direct sharing of memory as high level as possible and thinking from a user perspective. I'm curious whether turning off a thread safety option to trade performance for risk would ever come up in any scenario (perhaps in a testing scenario for a research project). But it's just philosophical and from a practical point of view, I consider my task to find perfect logical solutions to these problems. Ross, I'll need to reflect on your statement about weak/strong some more but all these points are very useful. Thanks.
Also a philosophical question: For a system like this would it ever be desirable to ignore a rare crash or conflict issue in favor of efficiency? I can imagine that for normal use multithreaded cross over would occur in a way to make these errors rare and perhaps exceptionally rare.
The strategy would be to make it correct first (without using a mutex everywhere), and then refactor to make it faster once we have all the strong unit tests in place. I think it would be okay initially to put in a mutex when the RCP decided it wanted to delete the shared object. That mutex would be amortized against the dynamic memory management and most programs should not be constantly creating and deleting little objects so that overhead might be acceptable (but timing will tell). What we can't afford is a mutex for every reference count update (which was the initial proposal).
I'm curious whether turning off a thread safety option to trade performance for risk would ever come up in any scenario
I think there might be cases. To support that, we might have an optional template argument that determines if the MM objects are thread-safe or not. The default (at Trilinos 13.0) should be thread-safe, but we could allow some client code to create non-threadsafe MM objects if they know their usage is not going to be involved in multi-threading. That will complicate the design, testing, and maintenance of the Teuchos MM classes, but it might be worth it. We will have to see.
But it's just philosophical and from a practical point of view, I consider my task to find perfect logical solutions to these problems.
That is a good goal but often it takes many iterations to get there. The first iteration should be just to make the reference counting 100% correct and robust with very strong unit testing. Then in later iterations we can refactor to make faster and clean up the code. With the strong unit tests for thread-safety in place, these refactorings will be very safe to perform.
Why can't we just base something off the reference counting scheme of the new view implementation (which doesn't use a singleton scheme with centralized reference count objects)? That scheme is fairly fast and well tested.
Why can't we just base something off the reference counting scheme of the new view implementation (which doesn't use a singleton scheme with centralized reference count objects)? That scheme is fairly fast and well tested.
Does that scheme have node tracing implemented with debug-mode checking for dangling references? That is a big part of helping developers and users debug problems with their usage of the MM classes. In an optimized build, there is no node tracing (and therefore no checking for invalid usage). That is the main value of Teuchos RCP over std::shared_ptr. For example, there are no tools in std::shared_ptr to help you track down circular references between objects.
Why can't we just base something off the reference counting scheme of the new view implementation (which doesn't use a singleton scheme with centralized reference count objects)?
We could do that, as long as we don't actually make Teuchos have a required dependency on Kokkos. Other projects that do not currently require C++11 depend on Teuchos and use Teuchos::RCP
etc.
Also a philosophical question: For a system like this would it ever be desirable to ignore a rare crash or conflict issue in favor of efficiency? I can imagine that for normal use multithreaded cross over would occur in a way to make these errors rare and perhaps exceptionally rare.
Can you compute a probability, and the uncertainty in the probability? Remember that this is a function of
If you can't, then don't say "normal use" or "rare."
Making progress and resolved test 3. Now that I understand the flow better this can be resolved (for strong count only or weak count only) with some trivial restructuring. Still have an issue with mixed weak/strong counts which I hope to be working on soon.
I have a more general question about unit tests and acceptable timing. I've been trying to optimize these tests so they can trigger the race conditions more efficiently. Spin locking all the threads and releasing them simultaneously helps a lot but I'm still looking at several seconds to reliably trigger this (on my mac). I was wondering if spin locking all the test threads should be definitely avoided for any reason?
As has been pointed out, this will be architecture dependent as well. It may be difficult to make the test work in a suitable time frame. For now I am not going to worry about this too much and progress with getting things in place, but I am interested in how we may prefer to handle tests if they require longer run times to trigger a fail event.
Thanks.
Making progress and resolved test 3. Now that I understand the flow better this can be resolved (for strong count only or weak count only) with some trivial restructuring.
Excellent!
I have a more general question about unit tests and acceptable timing. I've been trying to optimize these tests so they can trigger the race conditions more efficiently. Spin locking all the threads and releasing them simultaneously helps a lot but I'm still looking at several seconds to reliably trigger this (on my mac). I was wondering if spin locking all the test threads should be definitely avoided for any reason?
Don't worry about how long they take right now, as long as it is not excessive (i.e. a minute or more). We have different test categories that we can put on tests so we may only run the more expensive tests in NIGHTLY testing (see the CATEGORIES NIGHTLY).
As has been pointed out, this will be architecture dependent as well. It may be difficult to make the test work in a suitable time frame. For now I am not going to worry about this too much and progress with getting things in place, but I am interested in how we may prefer to handle tests if they require longer run times to trigger a fail event.
If we need to, we can tweak the parameters for the most detailed testing based on the architecture (CMake can tell use something about this) or we can just run the most detailed testing on a few known nightly testing machines to avoid guessing.
Update on RCP Unit Tests:
If we expect different threads to call set_extra_data() on the same RCP it should be fairly easy to set up a unit test to show this will break and probably not difficult to fix this if it's not performance critical. Please let me know your thoughts on that.
So summary is that first set of unit tests if implemented, fixes are in place for all but the weak ptr, and I will work on parallel style tests for the related classes now.
hi Ross - I have a few ideas about how we can solve threading issues with weak RCP that I would like to run by you. I'll continue to work on tests for strong RCP only, but just thinking ahead to how the structure may end up:
I would summarize the problem like this: -We have two independent atomic counters (strong and weak) both which can trigger data deletion (strong can delete the object and strong or weak can delete the node). -We need to avoid race conditions between the two counters but we don’t want to cause any performance problems (mutex wrapping both for example). -The delete process currently requires proper ordering: first object then node
1st proposal: Change the counters to strong and total (instead of strong and weak). I think this is a fairly easy change and would solve problems determining when total becomes 0 as a unique event. Weak is now calculated (total - strong) but this may be preferable because weak count is not directly tied to data (total is the important value).
This still leaves the problem of race conditions causing the deletion of the node and the deletion of the object to overlap. In the current form we must delete the object first, then the node, but I think enforcing that order must create some type of performance penalty.
2nd proposal: A solution would be to decouple the deletion of object and node so they can happen simultaneously and independently. This would require saving local copies of relevant ptrs before checking for a strong decrement, then using those immediately if the counter decrements to 0. But I think the following scheme will not impact performance:
Proposed unbind() method for strong RCP (weak RCP only needs to do step 3): (1) save a local copy of relevant ptrs (ptr_, extra_datamap, dealloc_ and debug stuff) (2) if (--strong_count == 0) delete_obj(use local ptrs); // deleting object (3) if (--totalcount == 0) delete node // deleting node
I'd have to change dealloc to a ptr to make copying the temporary local value efficient - I think that might be ok but need to dig into that code more.
Approach of step 1 could be potentially too messy and need to think about the best way to organize - but it may also be very efficient allowing weak/strong race conditions to be resolved without creating any priority road blocks.
For Weak to Strong conversion, it would be only based on changing one counter (strong) since total is not changed. So we want an atomic operation equivalent to: if( strong == 0 ) return false // this is like weak_ptr returning null when we try to covert to shared_ptr else ++strong // success - we can now switch our state from weak to strong return true
I do not think there is a straight forward way to implement this operation using any available atomic operations but I am learning more about them. Perhaps we could first get: int count = strong_count then if count is 0, the conversion is considered failed, but if it is not 0 try: if( strong_count.compare_exchange_strong( count, count +1 ) ) { success! } to see if we successfully incremented a non 0 value meaning we had definite success. That could be looped several times as weak to strong conversion might not ever be performance critical, but would still not be guaranteed to succeed if strong count was being changed rapidly ... I am interested in learning the most elegant solution to this problem.
So I will continue to think about this but am interested in your feedback on this general approach and any red flags regarding special scenarios I have not thought about yet.
If we did like this scheme, I think it will be fairly easy to implement and try out. For now continuing with strong RCP only.
For RCP and ArrayRCP, for example, some unit tests might be considered redundant if they test the same underlying mechanism.
Would we prefer to have all the parallel tests implemented for each class? Or would we rather only add unit tests that stress test new mechanisms?
I don't want to muck up the test pipeline with duplication but to be prudent, perhaps having a full parallel set of tests for each would be better.
For RCP and ArrayRCP, for example, some unit tests might be considered redundant if they test the same underlying mechanism.
Yes, that is true. Because these classes share the same underlying reference-counting machinery, we really only need detailed multi-threaded unit tests with Teuchos::RCP and then just some basic multi-threaded tests with Teuchos::ArrayRCP just to make sure we did not miss anything major.
Would we prefer to have all the parallel tests implemented for each class? Or would we rather only add unit tests that stress test new mechanisms?
I don't think we need to duplicate all of the unit tests in the multi-threaded case. We only need multi-threaded tests for the use cases that are most likely to show bugs.
I don't want to muck up the test pipeline with duplication but to be prudent, perhaps having a full parallel set of tests for each would be better.
It would be a lot of work to make all of the existing unit tests work correctly in the multi-threaded case or to at least make them test something meaningful with multiple-threads over the single-threaded case.
I will respond to your comments next. Sorry for the delay.
Update on RCP Unit Tests:
...
6 - RCP set_extra_data. This is also fine but only if we assume that the set_extra_data is defined when the RCP is allocated. I think this may be the intended use?
You typically set extra data after the initial RCP and RCPNode is created. This is usually used to tack on extra objects that need to be freed when the main managed object is freed (either before or after the managed object is freed).
If we expect different threads to call set_extra_data() on the same RCP it should be fairly easy to set up a unit test to show this will break and probably not difficult to fix this if it's not performance critical. Please let me know your thoughts on that.
That is a good point. The first time someone adds extra_data, the poitner to the std::map is null. But after the first bit of extra data is added, the std::map object gets created. It seems that it is safe to read from an std::map in multiple threads but not write to one. Since set_extra_data() shoulid be called only very rarely, I think it is acceptable to use a mutex to protect this.
7 - Dangling weak ptrs - I set up a simple test to show a weak and strong ptr to the same RCP can cause trouble if they are deleted in different threads. However you indicated we may want to wait on this so I will proceed with the other classes for the moment, but am thinking about some ideas for how to best resolve this.
I think it is okay to address basic referencing counting issues with runtime weak and strong RCP object. The issue that I want to put off, for now, is directly accessing the shared object from a weak RCP. That, I am afraid, is going to require breaking backward compabibility and perhaps adding a new Teuchos::WeakRCP class.
So summary is that first set of unit tests if implemented, fixes are in place for all but the weak ptr, and I will work on parallel style tests for the related classes now.
Great, sounds good!
@MicheldeMessieres, sorry for the long delay in responding ...
hi Ross - I have a few ideas about how we can solve threading issues with weak RCP that I would like to run by you. I'll continue to work on tests for strong RCP only, but just thinking ahead to how the structure may end up:
I would summarize the problem like this: -We have two independent atomic counters (strong and weak) both which can trigger data deletion (strong can delete the object and strong or weak can delete the node). -We need to avoid race conditions between the two counters but we don’t want to cause any performance problems (mutex wrapping both for example). -The delete process currently requires proper ordering: first object then node
1st proposal: Change the counters to strong and total (instead of strong and weak). I think this is a fairly easy change and would solve problems determining when total becomes 0 as a unique event. Weak is now calculated (total - strong) but this may be preferable because weak count is not directly tied to data (total is the important value).
That is a good idea. So to increment or deincrement the weak count, you just increment the total counter. Nice!
This still leaves the problem of race conditions causing the deletion of the node and the deletion of the object to overlap. In the current form we must delete the object first, then the node, but I think enforcing that order must create some type of performance penalty.
The tricky part is that in order to delete the object, you may have to get data out of the node (the Dealloc object or the extra data). So in that sense, it has to be ordered.
2nd proposal: A solution would be to decouple the deletion of object and node so they can happen simultaneously and independently. This would require saving local copies of relevant ptrs before checking for a strong decrement, then using those immediately if the counter decrements to 0. But I think the following scheme will not impact performance:
Proposed unbind() method for strong RCP (weak RCP only needs to do step 3): (1) save a local copy of relevant ptrs (ptr_, extra_datamap, dealloc_ and debug stuff) (2) if (--strong_count == 0) delete_obj(use local ptrs); // deleting object (3) if (--totalcount == 0) delete node // deleting node
I'd have to change dealloc to a ptr to make copying the temporary local value efficient - I think that might be ok but need to dig into that code more.
I don't think we can maniuplate the Dealloc object with a pointer. We need to just copy it by value.
I think it is okay to make the final deletion (of either the object or the node) a little more expensive, perhaps using a mutex since it does not happen very often (just once per RCPNode object). I suspect the runtime performance for that should not be too bad.
Approach of step 1 could be potentially too messy and need to think about the best way to organize - but it may also be very efficient allowing weak/strong race conditions to be resolved without creating any priority road blocks.
For Weak to Strong conversion, it would be only based on changing one counter (strong) since total is not changed. So we want an atomic operation equivalent to: if( strong == 0 ) return false // this is like weak_ptr returning null when we try to covert to shared_ptr else ++strong // success - we can now switch our state from weak to strong return true
I do not think there is a straight forward way to implement this operation using any available atomic operations but I am learning more about them. Perhaps we could first get: int count = strong_count then if count is 0, the conversion is considered failed, but if it is not 0 try: if( strong_count.compare_exchange_strong( count, count +1 ) ) { success! } to see if we successfully incremented a non 0 value meaning we had definite success. That could be looped several times as weak to strong conversion might not ever be performance critical, but would still not be guaranteed to succeed if strong count was being changed rapidly ... I am interested in learning the most elegant solution to this problem.
So I will continue to think about this but am interested in your feedback on this general approach and any red flags regarding special scenarios I have not thought about yet.
I am not sure what the best approach is to solve this since I don't personally have much experience with threading or with C++ threading. But the one requirement that we would like to have is that simpling incrementing or deincrementing the strong or weak count should involve only atomics and not mutexes. But I think it is okay, at least initially, to use a mutex when it is clear that the object and/or the node are going to be deleted. But we don't want the to involve a mutex in the check to see if the stong or weak/total count has gone to zero.
For example, if you see that strong==0, then you create a mutex, step in and then check to see if the object has already been deleted by another thread or not (in which case the object ptr will be null). If the object ptr is non-null, then you delete it and set it to null and release the mutex. However, if the object ptr is null, then you just release the mutex and return. I think that with most object or node deletions, that would involve just one mutex on one thread. But if there was an active race, then more mutexes. I am not sure exactly how to implement that with C++11 but something like that should be possible.
If we did like this scheme, I think it will be fairly easy to implement and try out.
Some experimentation is not bad. Just use side branches for these side experiments. That way, if we like one of them, we can just merge. And if we don't, we just never merge.
For now continuing with strong RCP only.
Sounds good @MicheldeMessieres. Thanks again for taking this on!
Thanks Ross,
To clarify, were you thinking of a static mutex: I thought this might be too slow for release? Maybe we don't expect much simultaneous traffic.
or a member mutex on the node: Can handle deletion of the object but problematic for proper deletion of the node itself.
If you think a static mutex is ok, then that should simplify several problems.
To clarify, were you thinking of a static mutex: I thought this might be too slow for release? Maybe we don't expect much simultaneous traffic.
or a member mutex on the node: Can handle deletion of the object but problematic for proper deletion of the node itself.
If you think a static mutex is ok, then that should simplify several problems.
I think it is worth starting out with static mutex on the deletion only and see what the time hit is. Since this should only be called once or twice at most per managed object, perhaps it will not be too bad.? The issue is that we can't afford to use a mutex or anything that expensive on the counter increment and deincrement operations. We know that will have a significant impact on performance in some cases. For a some managed objects, it is not out of the question for the reference counts to be incremented and de-incremented hundreds of times. We could add instrumentation to see how much the reference counts are incremented (actually, gprof could do that) in a larger program but this will be program specific.
Would it be possible to use an atomic pointer for the managed object?
http://stackoverflow.com/questions/26787086/atomic-pointers-in-c-and-passing-objects-between-threads
http://en.cppreference.com/w/cpp/atomic/atomic
That might help to avoid a mutex for the deletion of the managed object.
Actually, it would be interesting to study how std::shared_ptr or boost::shared_ptr handles these situations. How does it avoid using mutexes on deletion? I just wish I knew more about threaded programming with C++11 and, more importantly, common idioms for threaded programming with C++11.
hi Ross,
I think I've got the solution now - no mutex or additional parameters.
First we change atomics to strong and total count (not strong and weak) as we discussed.
Then strong unbinds like this: if( --strong == 0 ) delete object if( --total == 0 ) delete node
And weak unbinds like this: if( --total == 0 ) delete node
I had convinced myself there was a race condition where node deletion could happen at the same time as object deletion for weak/strong mixed. But I think I was wrong - it's impossible for delete node to call while delete object is in progress if we do things in that order because strong preserves 1 for total as well. Of course, I may realize I am wrong in 10 minutes... but i think it's a pure solution.
So I'm excited to try that out on Monday and will let you know if weak/strong race is all cured.
I anticipate one minor issue: We reincrement before deleting so that a throw event will leave the count as if it was not deleted. I assume this is a debug only thing related to tracing but wasn't clear on the necessity of doing that and didn't investigate yet. That could break the weak to strong conversion where the weak could incorrectly think strong was available. But if it's just needed for debug we could simply add a flag to RCPNode which says bInvalid and then the weak would check that before converting.
I will check into underlying mechanism for the shared_ptr or boost. Now that I am better understanding the direct relationship between what RCP is doing and what those objects do, it does seem that could be most helpful.
I know there are some different methods like linked lists or shared counter object (as we are doing). I think shared_ptr is based just on the decrement of the counters so could be equivalent to the proposed idea above. I'll see if I can find out how the mechanic works for converting weak to strong. There are probably all kinds of subtle compiler-dependent issues which I'm unaware of.
I'd like to try something like this for weak to strong conversion:
int check = strong; if( check == 0 ) { we definitely failed } else if( atomic_compare_exchange_strong( strong, check, check + 1 ) { we definitely succeeded } else { we can't be sure and must loop this code again }
If you don't object I can prototype the new RCPWeak we discussed. I think the idea you have is that it would be identical to RCP but have no direct ptr_ shared access. So no operators for access etc. Basic use would be to call RCP = RCPWeak.create_strong() which may fail.
If I create a unit test for conversion using create_strong() where the last strong may or may not exist, it might be useful to validate the basic plan for counters is not going to have a tragic flaw later.
I think I've got the solution now - no mutex or additional parameters.
Excellent!
First we change atomics to strong and total count (not strong and weak) as we discussed.
The one disadvantage of changing the weak count to the strong count is that now it will take two increments and deincrements to adjust the strong count. That will impact performance. Also, will that also make the incremnt of strong no longer atomic? I mean, you will have to increment strong and then total (or in the reverse order).
Then strong unbinds like this: if( --strong == 0 ) delete object if( --total == 0 ) delete node
And weak unbinds like this: if( --total == 0 ) delete node
I had convinced myself there was a race condition where node deletion could happen at the same time as object deletion for weak/strong mixed. But I think I was wrong - it's impossible for delete node to call while delete object is in progress if we do things in that order because strong preserves 1 for total as well. Of course, I may realize I am wrong in 10 minutes... but i think it's a pure solution.
So I'm excited to try that out on Monday and will let you know if weak/strong race is all cured.
Sounds like worth trying.
I anticipate one minor issue: We reincrement before deleting so that a throw event will leave the count as if it was not deleted. I assume this is a debug only thing related to tracing but wasn't clear on the necessity of doing that and didn't investigate yet. That could break the weak to strong conversion where the weak could incorrectly think strong was available. But if it's just needed for debug we could simply add a flag to RCPNode which says bInvalid and then the weak would check that before converting.
I am thinking that we should likely require that exceptions not be thrown inside of destructors. It is bad form to throw inside of any destructor and typically, the deincriment and delete occurs inside of a destructor. I know that there is a unit test that shows that RCP can gracefully handle exceptions inside of a destructor but supporting that use case gracefully is likely not worth complicating the thread-safe implementation.
I will check into underlying mechanism for the shared_ptr or boost. Now that I am better understanding the direct relationship between what RCP is doing and what those objects do, it does seem that could be most helpful.
When I looked years ago, it was using spin locks in several places. My plan for making the Teuchos MM classes thread-safe was to study what boost::shared_ptr did and then pick the C++11 features that largely duplicated that. Remember that boost::shared_ptr was written before the C++11 standard came out so they had to use platform-specific primitives.
I know there are some different methods like linked lists or shared counter object (as we are doing). I think shared_ptr is based just on the decrement of the counters so could be equivalent to the proposed idea above. I'll see if I can find out how the mechanic works for converting weak to strong. There are probably all kinds of subtle compiler-dependent issues which I'm unaware of.
The boost::shared_ptr implementation I looked at was definitely using a shared node. My guess is that boost::shared_ptr was likely copied into std into GCC, etc.
I'd like to try something like this for weak to strong conversion:
int check = strong; if( check == 0 ) { we definitely failed } else if( atomic_compare_exchange_strong( strong, check, check + 1 ) { we definitely succeeded } else { we can't be sure and must loop this code again }
If you don't object I can prototype the new RCPWeak we discussed. I think the idea you have is that it would be identical to RCP but have no direct ptr_ shared access. So no operators for access etc. Basic use would be to call RCP = RCPWeak.create_strong() which may fail.
It is okay to prototype this. But can we call this WeakRCP instead of RCPWeak?
However, before we do anything major with weak and strong pointers, we should likely talk over Skype or something. Runtime-WEAK RCP (and ArrayRCP) objects play an important role in helping to detect dangling references in debug-mode checks and I don't want to loose that when we add thread-safe support. I have some ideas about how we can make accessing the underlying object through a runtime-WEAK RCP thread-safe in a debug-mode check and to throw exceptions in case a race to delete occurs. It will be complicated but if we can do this, it will provide a huge advantage of std::shared_ptr and related classes that can't do anything like this. Anyway, this will be something worth talking through.
If I create a unit test for conversion using create_strong() where the last strong may or may not exist, it might be useful to validate the basic plan for counters is not going to have a tragic flaw later.
Yes, we need that unit test.
hi - I'll focus on testing the counters are working and just prototype the weak->strong conversion in the current RCP class so I don't get ahead of myself. I can get that weak/strong unit test setup just using that. I think we'll have a chance to talk Thursday or Friday when I come out so hopefully I'll have all the basic mechanics working and then we can talk about these points in detail.
When I looked years ago, it was using spin locks in several places. My plan for making the Teuchos MM classes thread-safe was to study what boost::shared_ptr did and then pick the C++11 features that largely duplicated that. Remember that boost::shared_ptr was written before the C++11 standard came out so they had to use platform-specific primitives.
Last time we looked, boost::shared_ptr used platform-specific primitives for atomic integer increment / decrement, where available.
Note that Boost uses a separate type (weak_ptr) for weak references, and does not have RCP's debug mode tracking.
It sounds like we want to follow boost as closely as possible without losing our tracing options so I am looking at the boost implementation to get guidance.
It seems they do keep a strong and weak count stored together, but the cool trick they use is that weak is bumped +1 whenever strong is not 0.
Here is some boost code for the reference counting node base class:
class sp_counted_base atomic_int_least32_t usecount; // #shared atomic_int_least32_t weakcount; // #weak + (#shared != 0)
This seems much better than my idea of switching to strong and total because it addresses the concern Ross had about strong requiring two atomic increments - I was not appreciating how much performance pain that will cause. So in this scheme we only have double increments when we transition to/from 0 which I expect is not a performance concern.
So I have to look into details but I imagine the flow works as follows:
We will hold two atomic counters: strong weakPlus // they just call it weak_count, but has the +1...
Create a new strong: // combination is not atomic but ok because we are constructing strong = 1 weakPlus = 1
Create a strong from another strong: ++strong
Create a weak from a strong: ++weakPlus
Delete a strong: if( --strong == 0 ) { delete object if( --weakPlus == 0 ) delete node; }
Delete a weak: if( --weakPlus == 0 ) delete node;
Create a strong from a weak: ++strong (but only if strong is successfully atomically incremented from a non-zero value)
Looking at the boost implementation for the conversion, it seems they do something similar to the prior discussion using an infinite loop and atomic_compare_exchange_weak. I will look into this.
It seems they do keep a strong and weak count stored together, but the cool trick they use is that weak is bumped +1 whenever strong is not 0.
Interesting. Let me know how it goes.
That weak +1 counting system works great. Makes a very nice logic flow.
I got tripped up sorting out some issues with threads and exception handling but I think it's all resolved now. Starting to appreciate some of the subtleties and look forward to discussing that further with you.
That weak +1 counting system works great. Makes a very nice logic flow.
I got tripped up sorting out some issues with threads and exception handling but I think it's all resolved now.
Wow, that is fantastic!
Starting to appreciate some of the subtleties and look forward to discussing that further with you.
Sure thing. When you are ready, please go ahead and do a pull request against the Trilinos 'master' branch and then I can start looking things over and continue the conversation. We can reference that PR in this Issue ticket. If you could rebase your commits on top of Trilinos 'master' and amend them to reference this Issue number #229, then that would be great. When we eventually do merge this into the main development branch (which will be 'develop' instead of 'master' soon), we will use a merge commit so that it will be easy to back out if anything terrible happens. But there is still lots of work to do before we are ready for that final merge.
NOTE: We reviewed the current state of the code last week. See https://github.com/Tech-XCorp/Trilinos/pull/2.
This story is to address the long-standing problem that the Teuchos Memory management Classes which use reference-counting are not thread safe.
CC: @MicheldeMessieres, @jwillenbring,
Next Action Status: See tasks ...
Tasks:
Trilinos_ENABLE_THREAD_SAFE
andTeuchos_ENABLE_THREAD_SAFE
(latter is given the given the default of the former value).Teuchos_ENABLE_THREAD_SAFE=OFF
in this case. Run full Trilinos test suite with-DTrilinos_ENABLE_CXX11=OFF
.NUM_TOTAL_CORES_USED
(see [TRIBITS_ADD_TEST())(https://tribits.org/doc/TribitsDevelopersGuide.html#formal-arguments-tribits-add-test))BASIC
test suite fast: Make the longer running threading testsNIGHTLY
.-DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=ON
(Trilinos_ENABLE_THREAD_SAFE
on and off)-DCMAKE_BUILD_TYPE=RELEASE -DTrilinos_ENABLE_DEBUG=OFF
(Trilinos_ENABLE_THREAD_SAFE
on and off)TEUCHOS_ABORT_IF(<condition>)
that will print and then call abort.