llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.69k stars 11.87k forks source link

OpenMP offload built-in profiler doesn't produce data for multiple threads #57985

Open markdewing opened 2 years ago

markdewing commented 2 years ago

The built-in profiler (described here: https://openmp.llvm.org/design/Runtimes.html#libomptarget-profile) only produces trace information for a single thread from the calling application.

The implementation uses LLVM's TimeTraceProfiler: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/TimeProfiler.h

According to the docs, each thread needs to call timeTraceProfilerInitialize and timeTraceProfilerFinishThread (for threads other than the main thread).

The difficulty with the the finish thread call is the library doesn't know when the application threads will be done.

One possible prototype solution:

  1. In each TIMESCOPE macro, check if the profiler is initialized on that thread.
  2. If not, call timeTraceProfilerInitialize. Get the newly created instance and save it on a list.
  3. At shutdown, just before timeTraceProfilerWriter is called, run through the list (from 2) and call a modified version of timeTraceProfilerFinishThread that accepts a timeTraceProfiler argument.
llvmbot commented 2 years ago

@llvm/issue-subscribers-openmp

jhuber6 commented 2 years ago

Right now we're using the entry into libomptarget which is a global ctor / dtor pair. Those are only called by a single thread by the C runtime. In order to support multi-threaded applications we use OPENMP_ENABLE_LIBOMP_PROFILING=ON which enables it when spawning threads via OpenMP. It's possible we could keep a global data structure that indexes based on the global thread ID. Some entries have access to the OpenMP global thread id which we could potentially pass in, but that's not a constant.

markdewing commented 2 years ago

The particular application I'm working on uses TBB for threading and OpenMP only for offload.

jdoerfert commented 2 years ago

I guess we could check if a thread called the init function via a map if profiles are enabled. @markdewing, you want to implement that?

markdewing commented 1 year ago

I tried the normal path for multithreading by creating the threads with OpenMP library and it appears profiling doesn't work as expected if LLVM is compiled as static libraries. Only OpenMP offload calls on the main thread get written to the json file. Profiling does as expected if LLVM is compiled and linked as a shared library. (LLVM_BUILD_LLVM_DYLIB and LLVM_LINK_LLVM_DYLIB are on)

I think the reason is the relevant functions get called from different shared objects, which each have their own copy of some static data. The timeTraceProfilerInitialize and timeTraceProfilerFinishThread calls for additional threads happen in kmp_runtime.cpp, which is compiled into libomp.so. The main timeTraceProfilerInitialize and timeTraceProfilerWrite happen in code compiled into libomptarget.so. The list of traces per-thread is kept in a function static variable in TimeProfiler.cpp (which is compiled into libLLVMSupport.a). The thread-finish call puts the additional per-thread data on a list that the write call doesn't see.

The solution to the original issue would also address this issue, and I can work on implementing it.

markdewing commented 1 year ago

There is a prototype that I've been using at https://github.com/markdewing/llvm-project/tree/omp_target_profile_thread It also has an environment variable to adjust the granularity.

There is a subtlety with profiling the breakdown of each tgt_target_kernel into data transfer and computation. The components are mappingBeforeTargetRegion, runTargetTeamRegion, and mappingAfterTargetRegion. The code in DeviceTy::runTeamRegion will run the region asynchronously as long the appropriate RTL functions exist. The synchronization is performed at the end of the__tgt_target_kernel function. Consequently, the time recorded for runTargetTeamRegion is only the time for launch, and the time for execution gets combined with the time for mappingAfterTargetRegion. For a workaround to get some profiling data, I disabled the async path in DeviceTy::runTeamRegion.