pmodels / bolt

Official BOLT Repository
https://www.bolt-omp.org
Other
27 stars 13 forks source link

BOLT tasks and ABT_cond #51

Open devreal opened 4 years ago

devreal commented 4 years ago

I am trying to leverage low-level Argobots features inside BOLT tasks (BOLT 1.0rc3, built with internal Argobots). In particular, I would like to block a set of tasks on a conditional variable and unblock them eventually from a different task, like in this example:

#include <abt.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  int n = 10;
#pragma omp parallel
{
#pragma omp master
{
  int blocked = 0;
  ABT_mutex mtx;
  ABT_cond cond;
  ABT_mutex_create(&mtx);
  ABT_cond_create(&cond);

  for (int i = 0; i < n; ++i) {
    printf("Discovering task %d\n", i);
  #pragma omp task shared(mtx, cond, blocked)
  {
    printf("Task %d blocking\n", i);
    ABT_mutex_lock(mtx);
    blocked++;
    ABT_cond_wait(cond, mtx);
    ABT_mutex_unlock(mtx);
  }
  }

  #pragma omp task shared(cond, mtx, blocked)
  {
    printf("Broadcast task starting\n");
    while (n != blocked) {
      ABT_thread_yield();
    }
    // mutex required to ensure all tasks entered cond
    ABT_mutex_lock(mtx);
    printf("Broadcast task broadcasting\n");
    ABT_cond_broadcast(cond);
    ABT_mutex_unlock(mtx);
  }

  #pragma omp taskwait
}
}
  return 0;
}

What I see is that all tasks are created and only the first task starts executing. Output:

$ ./test_bolt_abt_cond
Discovering task 0
Discovering task 1
Discovering task 2
Discovering task 3
Discovering task 4
Discovering task 5
Discovering task 6
Discovering task 7
Discovering task 8
Discovering task 9
Task 0 blocking

Any idea why only the first task is executing? Are the other runnable tasks not passed to Argobots? Do I need to set some environment variables to make this work?

shintaro-iwasaki commented 4 years ago

Thanks for reporting an issue! I tested in my environment and it seems that tasking logic a bug in BOLT (tasks are not parallelized). I will fix it this weekend (as well as #49).

devreal commented 4 years ago

Thanks for looking into this. I'll be happy to give it a try as soon as you have a fix ready :)

devreal commented 4 years ago

I tested with current master (4e6a8a4) but the problem persists. #47 did not fix it.

shintaro-iwasaki commented 4 years ago

Originally BOLT had a few tasking bugs, which I hope have been fixed in several PRs. I also added tests to make sure OpenMP tasks and OpenMP threads are scheduled in parallel. Thank you very much for reporting this issue!

"Correct" but nonintuitive behavior

Now it works "correctly" (in my understanding); I finally found that the current BOLT design does not run your program correctly because Argobots blocking calls block OpenMP tasks. I used Clang 10.0 in the following experiments, but any recent compiler should be okay. I am not sure if an old GCC (e.g., GCC 4.x) works.

First, it works as follows on my four-core laptop.

$ # By default, the following is equivalent to KMP_ABT_NUM_ESS=4 OMP_NUM_THREADS=4 ./a.out 
$ ./a.out 
Discovering task 0
Discovering task 1
Discovering task 2
Task 0 blocking
Discovering task 3
Task 1 blocking
Discovering task 4
Discovering task 5
Discovering task 6
Task 2 blocking
Discovering task 7
Discovering task 8
Discovering task 9
Task 3 blocking
(hang)

Because OMP_NUM_THREADS=4, four OpenMP tasks are executed. Since all OpenMP threads are blocking in the discovering tasks, the other tasks are not scheduled.

There is a design issue in BOLT. At present, on Argobots blocking calls (e.g., ABT_cond_wait()), BOLT blocks "underlying OpenMP threads" as well as "currently running OpenMP tasks". This is because, unlike #pragma omp taskyield, the Argobots yield call (ABT_thread_yield(), which is executed in ABT_cond_wait()) does not release mapping between OpenMP tasks and OpenMP threads in BOLT. Such management is needed, for example, to schedule only four tasks in the above case (since there are only four OpenMP threads).

The fundamental reason is that BOLT maps both OpenMP threads and OpenMP tasks to Argobots threads (let's say ULTs). If an OpenMP task (=ULT) runs a ABT_thread_yield(), a natural expectation is that the task yields its control to the parent OpenMP thread, but it actually goes back to the parent Argobots scheduler (!= an OpenMP thread) since the parent OpenMP thread is also a ULT and scheduled by on an Argobots scheduler. To manage OpenMP thread-task mapping, #pragma omp taskyield and internal __kmp functions explicitly explicitly handle this mapping "in BOLT".

Runtime-level solutions

(I list a few options, but none of them are available now.)

  1. Make BOLT-aware Argobots synchronization calls Just create BOLT_cond_wait instead of ABT_cond_wait. This lowers interoperability and maintainability, so I don't like it.

  2. Map "OpenMP threads" to "Argobots schedulers" This is the fundamental solution, but it requires a significant change in Argobots. There are a few slightly different ways to implement it (1. make ABT_thread_yield() return to a parent "ULT", not a parent scheduler by 1.1 hooking ABT_thread_yield() or 1.2 changing the ABT_thread_yield() definition, 2. make "scheduler's scheduler" and let it schedule lightweight schedulers, ...). In any case, it cannot be implemented soon. Since this weird thread-task mapping management degrades the OpenMP tasking performance of BOLT, however, it will and should be fixed in the future (although it might not be the very near future).

  3. Ignoring thread-task mapping If we allow independent OpenMP threads that do not belong to a team but can execute certain OpenMP tasks, this issue would be solved. Unfortunately, the current OpenMP specification and implementation do not allow such.

User-level solutions

Regardless of the number of Argobots schedulers (which is, in the current implementation, equal to the number of Pthreads), giving enough executors (i.e., OpenMP threads) is the easiest solution.

$ # On my laptop, equivalent to KMP_ABT_NUM_ESS=4 OMP_NUM_THREADS=11 ./a.out 
$ OMP_NUM_THREADS=11 ./a.out 
Discovering task 0
Discovering task 1
Discovering task 2
Task 0 blocking
Discovering task 3
Discovering task 4
Discovering task 5
Discovering task 6
Discovering task 7
Task 3 blocking
Task 4 blocking
Task 5 blocking
Task 6 blocking
Task 7 blocking
Task 1 blocking
Task 2 blocking
Discovering task 8
Discovering task 9
Task 8 blocking
Task 9 blocking
Broadcast task starting
Broadcast task broadcasting
$ 

In reality, the threading performance of BOLT is not bad, so using OpenMP threads instead of OpenMP tasks is another way.

Anyway, thank you very much for giving us a very insightful question! We could find a few bugs, make scheduling tests, and realize the design issue in BOLT.