llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.21k stars 11.14k forks source link

[OpenMP] no async tasks when there is 1 thread only #63394

Open ye-luo opened 1 year ago

ye-luo commented 1 year ago

This is a libomp implementation issue not an OpenMP spec one. The async tasking behavior should be respected even when there is no parallel region and there is only a single implicit thread. I wrote this reproducer as a proxy for the issue of target tasks which should be asynchronous but not due to the limitation of the libomp behavior.

code

#include <iostream>
void run()
{
  #pragma omp task
  std::cout << "running task 1" << std::endl;
  #pragma omp task
  std::cout << "running task 2" << std::endl;

  std::cout << "task 1 & 2 launched" << std::endl;
  #pragma omp taskwait
  std::cout << "taskwait completed!" << std::endl;

}

int main()
{
  std::cout << "run without parallel" << std::endl;
  run();

  std::cout << "\nrun with parallel but 1 thread" << std::endl;
  #pragma omp parallel num_threads(1)
  run();

  std::cout << "\nrun with parallel but 2 thread" << std::endl;
  #pragma omp parallel num_threads(2)
  run();
}
$ clang++ -fopenmp test_task.cpp && ./a.out 
run without parallel
running task 1
running task 2
task 1 & 2 launched # no asynchronicity, both tasks are treated as included tasks and execute immediately. 
taskwait completed!

run with parallel but 1 thread
running task 1
running task 2
task 1 & 2 launched # no asynchronicity, both tasks are treated as included tasks and execute immediately. 
taskwait completed!

run with parallel but 2 thread
task 1 & 2 launched  # both tasks are async
running task 2
running task 1
taskwait completed!
running task 1
running task 2
task 1 & 2 launched
llvmbot commented 1 year ago

@llvm/issue-subscribers-openmp

jdoerfert commented 1 year ago

For tasks that are not target nowait, why would we not execute them immediately in a sequential environment?

ye-luo commented 1 year ago

target nowait task can be viewed as a special case of a detached task. If such tasks are executed immediately and wait for its completion, the only thread gets blocked and no further tasks can be created. If all the tasks can be created before any task gets scheduled, when a detached task gets executed but not fulfilled, other tasks can still be run.

jdoerfert commented 1 year ago

what I'm trying to say is that we probably are fine with the behavior of the reproducer but we need to change how we model target tasks.

ye-luo commented 1 year ago

It made sense when OpenMP is only used for CPU. However, I would say keeping 1 thread case special is unhealthy and error-prone. Based on my experience writing all kinds of parallel codes, it is not worth keeping a special code path if it can be well folded into a generic case. I don't see good reasons that OpenMP runtime should be an exception. Why making target task special? To me it is a worse option than making task async regardless of the number of threads.

jdoerfert commented 1 year ago

To me it is a worse option than making task async regardless of the number of threads.

Arguably, that is inherently costly. That said, I am unsure if it matters in practice, honestly I'd be surprised. Long story short, I agree with your point.

@TerryLWilmarth, wdyt?

shiltian commented 1 year ago

This is a libomp implementation issue not an OpenMP spec one.

First of all, I'd not call it an issue. It is implementation choice. All current behaviors conform with OpenMP spec.

target nowait task can be viewed as a special case of a detached task. If such tasks are executed immediately and wait for its completion, the only thread gets blocked and no further tasks can be created.

This is not right. The completion of detachable tasks will only affect its dependent tasks. The execution of the encountering thread will proceed after the body of the detachable tasks are finished. If you were referring to an explicit task depending on a detachable task, then that is true.

As for those explicit tasks created in either implicit or inactive parallel region, to execute them immediately or to execute them at the end after the body of the parallel region is finished, I'd say that would be a debate. Either side can easily come up with cases that perform better than the other.

For target tasks (and what you concern), we indeed need to remodel it. That would be the request to OpenMP language committee.

ye-luo commented 1 year ago

For target tasks (and what you concern), we indeed need to remodel it. That would be the request to OpenMP language committee.

The spec doesn't dictate one way or another. It is more of an implementation choice. So I don't see any spec change needed.