GNU OpenMP 浅谈（一）

免责声明：本文内容仅来自于几天之内阅读各种不太相关的文档，以及阅读并猜测源码功能。所以所说内容不保证正确性和准确性，不论对错都欢迎讨论。

Motivation and References

本文的唯一目的，就是帮助和督促自己阅读和理解libGOMP（也就是GNU OpenMP library）源码，以便整理和总结出一个好的presentation，同时对自己目前读源码的认识进行记录，以便未来如果需要修改源码的时候，能够有个可以参照。对libGOMP的评价，我就不亲自说了，直接引用来自一位修改过其源码的同学的master thesis中的写的lessons learned吧：

The biggest difficulty concerning the implementation was the loop body outlining. The outlining required a signicant amount of work in the GCC middle end, which turned out to be even more intricate than expected, as the code base is huge, the documentation poor and bugs mostly show up various compiler passes later.

网上libGOMP相关的文档，无论是中文英文都很少，而源码里面又几乎没有什么有用的注释，所以很多的地方都需要自己体悟和猜测。OpenMP的其他implementation的文档事实上也并不是太多，而能拿来借鉴和辅助理解libGOMP的就更少了。目前搜到的有用的资料记录如下：

GCC源码
- http://gcc.gnu.org/svn/gcc/branches/
OpenMP and automatic parallelization in GCC
- libGOMP官方文档
- http://www.airs.com/dnovillo/Papers/gcc2006.pdf
Towards Dynamic Threading Support for OpenMP
- 关于动态线程调度的master thesis
- 内含GOMP的不少介绍
- http://e-collection.library.ethz.ch/eserv/eth:41331/eth-41331-01.pdf
How OpenMP* is Compiled
- OpenMP实现的宏观介绍，ppt
- 内有OpenMP的概念介绍
- https://iwomp.zih.tu-dresden.de/downloads/OpenMP-compilation.pdf
OpenMP implementation and performance issues
- OpenMP的其他实现，ppt
- 内有介绍overhead来源
- http://nereida.deioc.ull.es/~pcgull/lacs-uai03/slides/MatsBrorsson.pdf
How do OpenMP Compilers Work? Part 1 & 2 | Intel® Developer Zone
- 从Intel compiler的角度来讲OMP implemenation是如何结合在compiler内部实现的
- http://software.intel.com/en-us/blogs/2010/03/04/the-thing-from-another-world-or-how-do-openmp-compilers-work-part-1/

本文中的绝大多数内容，都来自于以上内容的参考，就不一一cite了。

OpenMP Standard

由于本文的目的不在于如何用OpenMP写parallel的程序，而在于了解GNU OpenMP implementation是如何实现和进行work和线程的调度的，所以这里不会详细的记录OMP的设定，只会摘录和schedule相关的内容。

和schedule最直接相关的两个设定，就是OMP_SCHEDULE以及OMP_DYNAMIC。

OMP_SCHEDULE modies the scheduling behavior of work-sharing loops with schedule kind runtime. OMP SCHEDULE allows all scheduling variants, except for runtime itself.
OMP_DYNAMIC species whether the number of worker-threads for parallel regions are dynamically adjusted to optimize the use of system resources.

OMP_SCHEDULE有4个option：static，dynamic，guided，runtime。

static When schedule(static, chunk size) is specied, iterations are divided into chunks of size chunk size, and the chunks are statically assigned to threads in the team in a round-robin fashion in the order of the thread number. Note that the last chunk to be assigned may have a smaller number of iterations. When no chunk size is specied, the iteration space is divided into chunks which are approximately equal in size, and each thread is assigned at most one chunk.

#pragma omp for schedule(static)
for (i = 1; i <= 100; i++)

QQ20130317-69

#pragma omp for schedule(static, 4)
for (i = 1; i <= 100; i++)

QQ20130317-70

dynamic When schedule(dynamic, chunk size) is specied, the iterations are assigned to threads in chunks as the threads request them. The thread executes the chunk of iterations, then requests another chunk, until no chunks remain to be assigned. Each chunk contains chunk size iterations, except for the last chunk to be assigned, which may have fewer iterations. When no chunk size is specified, it defaults to 1.

#pragma omp for schedule(dynamic)
for (i = 1; i <= 100; i++)

QQ20130317-71

guided When schedule(guided, chunk size) is specied, the iterations are assigned to threads in chunks as the threads request them. The thread executes the chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than 1), the size of each chunk is determined in the same way with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations). When no chunk size is specied, it defaults to 1.

runtime When schedule(runtime) is specified, the decision regarding scheduling is deferred until run time, and the schedule and chunk size are taken from the run-sched-var control variable.

Compiler Structure

compiler的内容实际上是非常messy的，中间有很多层的intermediate steps和intermediate representation。从一个较宏观的角度来看，OMP compiler大致通常是经历了如下图所示的过程讲source code翻译成machine code的。 8a6e911bfba96e683fd03d793cfdebe7

根据中间两层的intermediate representations作为分割，compiler可分为前端、中端和后端三个层次，如下图。之所以要设置中间表示的原因为： Separating compilers into a front-end and a back-end greatly simplifies implementation of compilers that support more than one programming language and compilation target. Instead of n x m implementations (with n being the number of programming languages and m being the number of compilation targets), one only needs to implement n front-ends and link them to m back-ends, which gives n+m compiler implementations. In this architecture, the compiled program is passed from the front-end to the back-end in an intermediate language, a language-independent representation of programs.

QQ20130317-67

在standard的compiler的基础上，为了能很好的处理OMP directives（pragma），还需要做以下的工作，包括识别和检查directives，预处理一些construct，进行进一步的优化，最后是调用runtime library已经封装好的并行功能。

OpenMP Compiler Front End
- In addition to reading in the base language (Fortran, C or C++)
  - Read (parse) OpenMP directives
  - Check them for correctness
  - Is directive in the right place? Is the information correct? Is the form of the forloop permitted? ….
  - Create an intermediate representation with OpenMP annotations for further handling
OpenMP Compiler Middle End
- Preprocess OpenMP constructs
  - Translate SECTIONs to DO/FOR constructs
  - Make implicit BARRIERs explicit
  - Apply even more correctness checks
- Apply some optimizations to code to
  - ensure it performs well
  - Merge adjacent parallel regions
  - Merge adjacent barriers
OpenMP Compiler Rest of Processing —— 这里应该就是libGOMP的主要位置
- Translate OpenMP constructs to multithreaded code
  - Sometimes simple
  - Replace certain OpenMP constructs by calls to runtime routines.
  - e.g.: barrier, atomic, flush, etc
  - Sometimes a little more complex
  - Implement parallel construct by creating a separate task that contains the code in a parallel region
  - For master thread: fork slave threads so they execute their tasks, as well as carrying out the task along with slave threads.
  - Add necessary synchronization via runtime library
  - Translate parallel and worksharing constructs and clauses e.g.: parallel, for, etc
- Also implement variable data attributes, set up storage and arrange for initialization
  - Thread’s stack might be used to hold all private data
  - Instantiate new variables to implement private, reduction, etc
  - Add assignment statements to realize firstprivate, lastprivate, etc
    GCC Compiler Structure

QQ20130317-72 和GOMP最相关的内容在上图中用黄色标示。

正如上面所归纳的，首先，在中端：As a first pass, the middle-end reduces the number of different constructs that need to be supported in code generation, by transforming the OpenMP program to a canonical form. Mapping the OpenMP constructs and directives to normal form saves the back-end from dealing with a lot of special cases and simplifies the templates for code generation. In its simplest form, the canonical form reduces the number of OpenMP constructs (see below) by at least 50%. 好像在GOMP中，并没有合并不必要的重复功能的过程。所有的OMP directives都可以在runtime library里面找到直接实现的function。同时，将directives转换为normal form的部分功能好像也是在libGOMP里面完成的。所以可以推测GOMP的实现思路和这里所讲的略有区别。

然后，The middle-end also optimizes the placement of OpenMP constructs and directives to make the OpenMP program more efficient. A common optimization is to remove superfluous barriers by merging barriers or adding nowait clauses to constructs where possible without affecting program semantics.

在中端转换和优化进行完毕之后，the back-end generates executable code from the OpenMP intermediate code representation. It first splices out the code of OpenMP constructs into new functions (so-called thunks) that are executed by the OpenMP threads at runtime. It then augments the sequential code with calls to the OpenMP runtime system to launch and control parallel execution.

6a1813197c024edad3c0b53496d9d814 由上图可见，GCC compiler事实上还有很多层。这些层不一定是中间表达，但应该是由很多次的pass之后逐渐形成的。

对应上面列举的general的OMP compiler的功能，可以在GOMP里面找到对应的部分

Implement parallel construct:
- GOMP_parallel_start & parallel region & GOMP_parallel_end
Master&worker thread:
- GOMP team & GOMP thead pool
Translate parallel and work sharing constructs and clauses
- GOMP work share
- GOMP for, single, task etc

因此libGOMP基本作用在GCC compiler的middle end结束时产生的representation中，会含有调用library的function call。而和scheduler最为相关的部分就是如何将parallel region outlining。

libGOMP

The runtime library (libgomp) is essentially a wrapper around the POSIX threads library, with some target-speciﬁc optimizations for systems that support lighter weight implementation of certain primitives. For instance, locking primitives in some Linux targets are implemented using atomic instructions and futex system calls. To support libgomp, the target must also implement thread-local storage.

Built-in Function Calls

The implementation is in gcc/libgomp and most entry points into the library are deﬁned as built-in function calls inside the compiler.

Thread creation The main entry point is GOMP_parallel_start, which takes as arguments the function to run on each thread, a pointer to the .omp_data_s structure as described earlier and the number of threads to be launched. If the speciﬁed number of threads is 0, the number of threads is computed automatically.

Once the parallel region ends, threads are docked so that they can be re-used at a later time. The master thread keeps executing the code after GOMP_parallel_start, which in this case is just another invocation to the same function that the children threads are executing. A call to GOMP_parallel_end Tears down the team and returns to the previous parallel state.

There are alternate entry points for combined parallel and work-sharing constructs that avoid one extra synchronization at the start of the work-sharing construct. The compiler tries to emit these combined calls when- ever possible (omp-low.c:determine_parallel_type).

Synchronization With few exceptions, most synchronization is just a direct mapping to the underlying POSIX routines. The exceptions are omp master and omp single:

omp master simply blocks the thread with a thread-id different than 0.

Work sharing Every scheduling variant of omp for has been implemented in the library. There are three main functions, GOMP_loop_start to initialize the loop bounds, GOMPloopnext to get the next chunk of iteration space to work on, and GOMPloop*_end to ﬁnalize the parallel loop.

Data Structures

To manage the threads, libGOMP makes use of several structs: gomp thread, gomp team state, gomp team and gomp work share.

Gomp thread is the libGOMP abstraction for a thread, and mainly consists of a pointer to the function that is currently executed, a pointer to the data-sharing struct used in this function, a semaphore used for ordering of loop iterations and a gomp team state in order to know which team the current thread is associated to.

Gomp team state is associated to every thread, and stores pointers to the gomp team and gomp work share structs along with a team id that uniquely identifies a thread within a team.

Gomp team is the abstraction of a team in libGOMP and provides an array of pointers to gomp work shares of all the work-sharing regions that are currently active within the team and a lock to ensure that access to the team happens mutually exclusive. Furthermore, gomp team provides a barrier for team synchronization, the team size, the previous gomp team state (before the master thread entered the current team) and other structures to ensure the management of the active work-sharing regions.

Gomp work share provides the abstraction of a work-sharing region, which stores the schedule kind of loops along with the next and last loop iteration variable value as well as the chunk size and the step size, by which the loop iteration variable is increased after an iteration. To ensure exclusive access by multiple threads, a lock is provided. To determine whether a work-sharing region can be deleted, gomp work share also needs to manage a variable that stores the number of threads that have left the current work-sharing region. Finally, to enable the ordered clause to be implemented further variables are provided.

Thread pool To avoid the overhead from thread creation each time a parallel region is created, libGOMP manages a thread pool. At the end of a parallel region the thread is not removed, but stored in a thread pool for later reuse. The current implementation of the thread pool has one limitation: Threads from nested teams are not store. For nested parallel regions, the threads of the inner parallel regions are terminated at the end of the inner parallel region rather than stored in the thread pool. The main reason for discarding inner-nested threads is that additional locking would be required when the thread pool is accessed. But locking is not necessary when there is only one nesting level stored in the thread pool, as the whole thread pool can be managed by the initial master thread.

QQ20130317-75 由于线程池只能回收一级parallelism的线程，所以GOMP对于nested parallelism的性能不是很好，建议使用默认的disable OMP_DYNAMIC，也就是只使用synchronous task structure。

OMP Translation Examples

OMP parallel

 #include <stdio.h>

 int main(int argc, char** argv) {
 #pragma omp parallel num_threads(NUM_THREADS)
     printf("Hello World\n");
     return 0;
 }

will be translated into

 void main_omp_func_0() {
     printf("Hello World\n");
 }

 int main(int argc, char **argv) {
     _omp_start_parallel_region(main_omp_func_0);
     main_omp_func_0();
     _omp_end_parallel_region();
     return 0;
 }

The two runtime functions _omp_start_parallel_region and _omp_end_parallel_region take care of all the administrative threading stuff that needs to be done internally. The call to _omp_start_parallel_region creates the team of threads, does some magic to satisfy OpenMP requirements (e.g. setting meaningful values for OpenMP's "internal control variables"), and hands over a pointer to the thunk for execution in the created worker threads. Some implementation might also choose to use a thread pool to speed-up the launch of a parallel region. In this case _omp_start_parallel_region wakes up sleeping threads and tells them to execute new code. Similarly, _omp_end_parallel_region tears down parallel execution by terminating the threads (or putting them to sleep in the thread pool), and cleans up internal structures to prepare the OpenMP runtime for execution of the next parallel region.

schedule dynamic QQ20130317-73 will be translated into QQ20130317-74

Overhead

不同schedule的overhead

QQ20130317-68 J. M. Bull, "Measuring Synchronization and Scheduling Overheads in OpenMP", EWOMP '99, Lund, Sep., 1999.

snowsugar / snowsugar.github.com