Open srirajpaul opened 5 years ago
Confirmed this is reproducible with provided example, simply running repeatedly will lead to deadlock.
Attaching gdb doesn't reveal any particularly interesting stacks. Mostly just:
at ../../src/hclib-locality-graph.c:877
I've figured out how this deadlock happens in the current branch with the Sriraj's example.
When the example runs with two threads and thread 0 (master) falls in a busy loop during the hclib_finalize in hclib_launch.
In other words, the master thread tries to steal a task from the thread 2 (created in the innermost closure) and then falls in a busy loop in the middle of hclib_finalize.
This probably would happen in some nested task creation. The master thread needs to consider that the current finish is incurred by 'hclib::launch'. It should check if all the other threads reach the finish.
The following is the current solution for this issue. I'll try to make a patch to implement this.
So, if each worker find out that the local work stealing queue is empty, then it increments a counter shared across threads. Workers continue to steal tasks from others until the termination signal is received from the master thread. (or some distributed combining tree can be adopted for this as LLVM OpenMP - tree / hierarchical synchronization is implemented)
Master threads see if the counter is the same as the number of threads. If not, it continues to do work stealing while checking if the counter reaches the number of threads.
Master thread sets termination signal to suspend the workers when the counter reaches the condition.
Please let me know if you have any comment on this issue.
@sbak5 getting the deadlock as before when we try to run the program repeatedly.
@sbak5 I can confirm that I am also able to reproduce the deadlock. While it appeared to be fixed, when I reduced my runtime threads to 2 it re-appeared. You can test this yourself by setting the environment variable HCLIB_WORKERS=2.
I'll try to reproduce the deadlock. I tested the example with the number of workers 2.
Let me know compiler version as well. Davinci use GCC 4.4 by default.
I tested this example on three machines at JLSE at Argonne with GCC 4.8.5.
Skylake 8180 2socket Haswell E5-2699 v3 4 socket Broadwell E5-2699 v4 4 socket
Run the example for 4000 times and run this script three times. No deadlock on all the machines.
I'm running the example on Davinci with two workers. Both hclib and example are complied in GCC 5.4.0 and 6.4.0.
Run them for 8000 times with the script above. Couldn't reproduce the deadlock.
I tried on Davinci with GCC 6.4.0. We can meet tomorrow and check.
@sbak5 could you also try deleting the print statement in the test? That may impact timing, and make it easier to not hit the deadlock.
I was able to reproduce this with a single run on my laptop. It did not take thousands of attempts to hit it.
@agrippa I finally reproduced the bug. It happens on laptop or login node of the clusters where the resource is not dedicated to the system. It is also infrequently incurred on dedicated environments such as compute nodes in clusters.
The main reason I've figured out so far is current hclib runtime doesn't consider nested finish invocations. If a worker invokes several finish scopes in a row, first some finish scopes are not resolved (the counter value cannot become 1) because execution of a task resets current finish scope and doesn't keep the stack of finishes.
In the example, each task has a finish inside and if any of the tasks are not finished, dependent tasks for the pending tasks cannot be scheduled -> Deadlock happens.
In current hclib implementation, it's hard to keep the stack of finish because start / end of each finish is not scheduled on the same worker. First, I'll try to fix this issue under the current implementation and think of some changes in the implementation.
@sbak5 I'm a bit confused. hclib 100% supports nested finish scopes. are you saying it doesn't? that the current support is broken? The hierarchy parent is maintained by finish->parent, and there's logic around finish->counter that detects when all wrapped tasks finish.
@sbak5 Yeah it is. It supports. Hierarchy only captures immediate successor. If multiple finish scope is nested (a worker calls tasks which incurs finish multiple times in a row), a worker can fall in a busy loop(core_work_loop) by context switching.
This loop cannot return and loop over the flag(hc_context->done_flags[wid]). The finish scopes called earlier in the stack cannot be completed, which make the task calling finish hangs.
This is the reason why the legacy master doesn't have this deadlock.
In the legacy master, 'help_finish' tries to execute a task before they switch to busy loop.
At least one of finish can return, which acts as a escape path in the call graph.
In the current implementation, this issue can be fixed easily by changing as following on help_finish in hclib-runtime.c
need_to_swap_ctx = find_and_run_task(ws, 0 -> **1**, &(finish->counter), 1,
finish);
This change tries to execute a task in this loop so one of finish scopes returns without context switching.
I pushed a patch to roll back my previous commit. Please test this commit on your machine with the example. I tested on login node of NOTS with HCLIB_WORKERS=2.
When using a sequence of async_await finish async_await , they dead lock intermittently.
A sample code that produces the error is as follows
to run the program use the commands: