Open markdewing opened 2 years ago
@llvm/issue-subscribers-openmp
Right now we're using the entry into libomptarget
which is a global ctor / dtor pair. Those are only called by a single thread by the C runtime. In order to support multi-threaded applications we use OPENMP_ENABLE_LIBOMP_PROFILING=ON
which enables it when spawning threads via OpenMP
. It's possible we could keep a global data structure that indexes based on the global thread ID. Some entries have access to the OpenMP global thread id which we could potentially pass in, but that's not a constant.
The particular application I'm working on uses TBB for threading and OpenMP only for offload.
I guess we could check if a thread called the init function via a map if profiles are enabled. @markdewing, you want to implement that?
I tried the normal path for multithreading by creating the threads with OpenMP library and it appears profiling doesn't work as expected if LLVM is compiled as static libraries. Only OpenMP offload calls on the main thread get written to the json file. Profiling does as expected if LLVM is compiled and linked as a shared library. (LLVM_BUILD_LLVM_DYLIB
and LLVM_LINK_LLVM_DYLIB
are on)
I think the reason is the relevant functions get called from different shared objects, which each have their own copy of some static data. The timeTraceProfilerInitialize
and timeTraceProfilerFinishThread
calls for additional threads happen in kmp_runtime.cpp, which is compiled into libomp.so. The main timeTraceProfilerInitialize
and timeTraceProfilerWrite
happen in code compiled into libomptarget.so. The list of traces per-thread is kept in a function static variable in TimeProfiler.cpp (which is compiled into libLLVMSupport.a). The thread-finish call puts the additional per-thread data on a list that the write call doesn't see.
The solution to the original issue would also address this issue, and I can work on implementing it.
There is a prototype that I've been using at https://github.com/markdewing/llvm-project/tree/omp_target_profile_thread It also has an environment variable to adjust the granularity.
There is a subtlety with profiling the breakdown of each tgt_target_kernel into data transfer and computation. The components are mappingBeforeTargetRegion, runTargetTeamRegion, and mappingAfterTargetRegion. The code in DeviceTy::runTeamRegion will run the region asynchronously as long the appropriate RTL functions exist. The synchronization is performed at the end of the__tgt_target_kernel function. Consequently, the time recorded for runTargetTeamRegion is only the time for launch, and the time for execution gets combined with the time for mappingAfterTargetRegion. For a workaround to get some profiling data, I disabled the async path in DeviceTy::runTeamRegion.
The built-in profiler (described here: https://openmp.llvm.org/design/Runtimes.html#libomptarget-profile) only produces trace information for a single thread from the calling application.
The implementation uses LLVM's TimeTraceProfiler: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Support/TimeProfiler.h
According to the docs, each thread needs to call
timeTraceProfilerInitialize
andtimeTraceProfilerFinishThread
(for threads other than the main thread).The difficulty with the the finish thread call is the library doesn't know when the application threads will be done.
One possible prototype solution:
timeTraceProfilerInitialize
. Get the newly created instance and save it on a list.timeTraceProfilerWriter
is called, run through the list (from 2) and call a modified version oftimeTraceProfilerFinishThread
that accepts atimeTraceProfiler
argument.