Closed goplanid closed 1 month ago
Hi @goplanid, I still have several questions regarding your use case and small reproducer would be helpful.
This is how I understood what you wanted to implement:
top level
.top level
tasks will do some compute and run nested parallel loop
.nested parallel loop
should be statically divided on particular number of threads.nested task
can start only all the required threads are arrived.In oneTBB we doesn't guarantee parallelism for parallel region so this example might hang on:
int num_threads_available = std::thread::hardware_concurrency();
tbb::task_arena arena(/* concurrency = */ num_threads_available);
arena.execute([&] {
tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), [&] (tbb::blocked_range<int> r) {
wait_threads();
},
/* Will divide work statically on num_threads_available */ tbb::static_partitioner);
});
It will work fine in most of the cases but might hang if for example you have several loops that have wait inside and each wait expects hardware_concurrency
threads than total number of threads in thread pool should be hardware_concurrency * num_loops
but TBB by default allows to create only hardware_concurrency - 1
threads (you can control this by global_control
).
Should the top level
tasks be processed in parallel (1) or it can be done in stages (2)?
(1)
You can try to use parallel_invoke or task_group to create appropriate nummber of tasks.
Each task then creates task_arena
with total_concurrency / number_of_top_level_tasks
to make it possible to run several static loops with barrier.
(2) You can try to use parallel_pipeline where nested task will be a parallel stage with required level of parallelism.
Hi @pavelkumbrasev, Thank you for providing your inputs. Your understanding is right.
Let me give more details with the below example:
I am dividing my dataset into multiple blocks/chunks where each block is parallely computed(using TBB). Each block is calling a third party library function that needs numjob threads to do its work. The library function has several loops that wait inside and each wait expects hardware_concurrency threads.
Here are my further experiments:
The example you provided above with task arena.execute() hangs in my case as you pointed out right as hardware_concurrency threads are expected.
I tried to set the total no of threads using tbb global_control like below for a 8 core machine (assuming I have 8 chunks of data): tbb::global_control gc(tbb::global_control::max_allowed_parallelism, num_jobs*8);
There are issues with this path:
tg.run([&]() {
tbb::parallel_for(tbb::blocked_range<int>(0, numjobs), innerLoopTask);
});
tg.wait();
Your guidance will be highly appreciated. Thanks.
Hi @goplanid,
If each chunk requires hardware_concurrency threads to process and you don't want to bog system with oversubscription perhaps the best performance can be achieved by serial loop for serial tasks and nested parallel_for
with static_partioner
.
Hi @goplanid,
To guarantee parallelism in the inner loop, you could try launching numjobs
threads (e.g., with std::thread
) in each outerLoopTask
, with each thread performing an innerLoopTask
.
You can prevent oversubscription by throttling down the oneTBB concurrency (e.g., to hardware_concurrency / numjobs
).
Hi @dnmokhov
Hi @pavelkumbrasev @dnmokhov
2.a How can i get more detailed logs with oneTBB. Like how many threads are actually being used in each innerLoopTask. I am suspecting no of threads could be an issue. 2.b Whether P-1 or P outer level threads will be used with each nested task arena ? 2.c What would be the overall threads used by TBB in this case that is with creation of nested task arenas?
Hi @goplanid,
- I am launching numjobs threads in each outerLoopTask using tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask); Is this correct? Any reason you have mentioned using std::thread above?
OneTBB parallel algorithms (e.g., parallel_for
) use available worker threads and do not launch new threads, so "there is no guarantee that potentially parallel tasks actually execute in parallel, because the scheduler adjusts actual parallelism to fit available worker threads" (https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler.html).
2.a How can i get more detailed logs with oneTBB. Like how many threads are actually being used in each innerLoopTask. I am suspecting no of threads could be an issue.
You can call current_thread_index()
in each task to log the thread it is using.
2.b Whether P-1 or P outer level threads will be used with each nested task arena ?
As mentioned above, there is no specific parallelism guarantee. The executed tasks are distributed among the available threads. When a thread completes a task, it will run the next available task, so some of the tasks can end up being run serially.
2.c What would be the overall threads used by TBB in this case that is with creation of nested task arenas?
By default, hardware_concurrency
threads are used. You can query this value with default_concurrency()
and change it with global_control
.
Hi @dnmokhov, Thank you for your inputs. Sorry for the late reply, I was on leave.
I debugged further using the above pointers and see that my inner loop is getting called using 31 threads when there are 2 outer loop threads. One of the outer loop threads is busy waiting and is not available for use in the inner loop. I want inner loop to be called using 32 threads(basically all threads on the machine). Is there a mechanism to yield in oneTBB for the outer threads so taht they are available in inner loop execution.
I also tried changing the no of threads using global_control in both outer and inner loop but it didn't help. Placing the code here.
outer loop:
tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
tbb::parallel_for(tbb::blocked_range
inner loop:
oneapi::tbb::task_arena nested;
tbb::global_control gc(tbb::global_control::max_allowed_parallelism, 32);
nested.execute( [innerLoopTask,numjobs]{
tbb::parallel_for(tbb::blocked_range
TBB Warning: The number of workers is currently limited to 31. The request for 32 workers is ignored. Further requests for more workers will be silently ignored until the limit changes.
Kindly correct if i am wrong anywhere and advise. You inputs are really appreciated.
Hi @goplanid,
The executed tasks are distributed among the available threads, so each of your 2 inner loops will be called using anywhere from 1 to 32 threads.
I want inner loop to be called using 32 threads(basically all threads on the machine)
To not bog down the system with oversubscription, perhaps the best performance can be achieved by a serial outer loop and nested parallel_for
with static_partioner
, as suggested here: https://github.com/oneapi-src/oneTBB/issues/1316#issuecomment-1968943142.
@goplanid is this issue still relevant?
If anyone encounter this issue in the future please open new issue with a link to this one
Hi,
I have below case of nested parallelism,
Level 1 or outer loop: tbb::parallel_for(tbb::blocked_range(0, 2),
outerLoopTask(A,B,C));
Level 2 or inner loop: tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
What I want to do: I want to run the above code with best possible nested solution provided by TBB. In the above code Level 1 runs for 2 iterations and each iteration of Level 1 runs numjobs no of iterations(as it is an inner loop). I have a dependency in my code such that innerLoopTask can only operate when exact no of numjobs threads are used.
Steps tried: To solve this problem I looked into work isolation page of documentation https://oneapi-src.github.io/oneTBB/main/tbb_userguide/work_isolation.html
I tried to create seperate task arena for each inner loop or level2 using the below code but it didn't help as I continue to see deadlock issue: oneapi::tbb::task_arena nested; nested.execute( [innerLoopTask,numjobs]{ tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
I also tried the isolate function using the below code but still see the same issue: oneapi::tbb::this_task_arena::isolate([numjobs, innerLoopTask]{ tbb::parallel_for(tbb::blocked_range(0, numjobs), innerLoopTask);
});
Help needed:
Any pointers will be of great help.