Open tongping opened 10 years ago
For the kmeans, it seems that the problem comes from reaping the threads. Since there are only 128 semaphores that can be created simultaneously, we should at most support 128 threads (one for each thread). However, in kmeans, we will create more than 128 threads in total. Thus, we have to reap those threads if they have exited successfully. However, there are some bugs to reap those threads. Then the program can't proceed and will be stucked there.
I have used a very simple test case, not kmeans, to exemplify this problem again.
We may change the policy of reaping threads. We should not wait until we don't have enough entries to hold a new thread. We could try to reap threads when there is only one actually-active thread and the number of threads is very to the total number of threads that we can support.
kmeans creates thousands of threads, but only a few are active at a time. Maybe when we run out of threads we should force the end of an epoch and reap threads that way.
A better long-term solution would be to support an unbounded number of active threads by allocating the thread-tracking structures on-demand.
What you describe is the actual mechanism that we are using now. Currently, we wait until the exhaustion of thread entries to reap those threads. However, this creates a problem that some of newly threads are actually running but others are dead threads. This complicates the handling of those newly threads. I bet that our current problem comes from there.
About the long-term solution, there is no much benefit to support an unlimited number of active threads, but it will certainly complicate the handling. Maybe we should not do this.
Now the problem is clear. When the main thread tries to reap those dead threads, it forgot to cleanup the existing events on the main thread. Thus, _sync.prepareRollback() won't up/signal the main thread. Main thread is waiting for semaphore forever and can't proceed.
It took so long to figure out this problem. We should use gdb to find out the problem easily.
This problem turns out that we didn't handle it right in prepareRollback() functions.
After fixed that problem, we have another problem. We are adding a macro before handling the system call like the following:
// Start the new epoch for current thread syscalls::getInstance().handleEpochBegin();
However, nowhere has used the same macro. Actually, these records should be handled anyway. During the fix, we remove this macro and change the name "handleEpochBegin()" to "epochBegin".
It is just stucked there in the rollback phase.