kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.12k stars 5.31k forks source link

CPU profiling points #4552

Open kkm000 opened 3 years ago

kkm000 commented 3 years ago

In CUDA code, we have NVTX calls wrapped in macros. There is no counterpart in CPU code, unfortunately. I'm flying blind.

I want to add more or less profiler-independent macros at strategic locations. The general design idea is like this.

I'm going to conceptually base it on the Intel ITT API, because VTune is a very powerful profiler, supported on all platforms (Windows, Mac, Linux, Android, even SoC), available for free, integrating very well in VS, and (never tried) Eclipse, has standalone GUI Web server with a support for profiling on a local or remote system (w.r.t. the Web server location; remote over ssh), can be operated from command line, imports Linux perf and ftrace and Windows ETW traces, supports comparison against stored baselines, and has common file format for all platforms, so that e.g. profile data can be collected on a headless Linux server and moved to Windows and analyzed in Visual Studio. Command-line analysis is also available.

At the base level, a profiler does not need special marker calls, but the API helps enhance the analysis greatly (cf. CUDA NVTX). Debugging and perfing such a dynamic system as CUDA decoder on the CPU side certainly needs a hand from the developer.

My idea is to wrap ITT calls into generalized macros, which can be defined for other profiling APIs. A global define -DKALDI_PROFILE would enable them; otherwise they would expand to nothing. To specialize the calls for ITT, -DHAVE_INTEL_ITT should be defined (these are orthogonal, except -DKALDI_PROFILE would warn that no concrete implementation is available, once, in the .cc file).

I would define the macros in the utils/kaldi-profile.{h,cc}. The header defines the macros, the .cc contains implementations if needed. The header should be included into files that have profiling calls.

There are several nested bracketing calls, and other which mark singular events. Some concepts, such as ITT Domains, which associate the events with the whole application, need not be exposed; the kaldi-profile.cc would just initialize a single domain and associate all events to it.

ITT defines the following groups of profiler hint calls.

Whenever feasible, RAII counterparts to profiling events should be defined, so that begin/end of a ROI is marked by the lifetime of an object.

  1. Collection control. Collection can be paused and resumed, most elementary controls.
KALDI_PROF_PAUSE();
KALDI_PROF_RESUME();
  1. Thread naming. This has been a huge sticking point for me, as all threads in CUDA decoder are unnamed, and there are 3 types of them. Nothing fancy here, a thread just gets a name visible in VTune UI, htop and gdb. Bonus offered by ITT, which may not be offered by other profilers: a thread can be set to ignored, and won't be shown. Linux limits the names to 15 characters.
KALDI_PROF_THREAD_NAME("CuDecH2HCopy");
KALDI_PROF_THREAD_IGNORE();
  1. Tasks. A task is an activity not spanning a thread. Tasks can be nested, but not intersect. RAII is possible. Naturally, a trivially defined RAII object is optimized out by a compiler. Implementation:
KALDI_PROF_TASK_IN(DeterminizeLattice);
 . . .
KALDI_PROF_TASK_OUT(DeterminizeLattice);

Bur RAII style is more natural here:

KALDI_PROF_TASK(DeterminizeLattice) {
  . . .
}

Note no quotes around strings; these are stringized and interned to IDs for performance.

  1. Frames. Activities possibly spanning threads, so no RAI, just like the first task example. The name "frame" is not the best; may be called "Activity".

  2. Events. Mark named points in time. The analysis shows time between consecutive points. ITT requires events declared before use.

  3. Counters. Declared global-ish objects (e.g., class- or global-scoped) that can be atomically incremented, decremented or set. Changes are recorded in the profile.

Unlike logging, the events of types 3 to 6 above support 1-2 kHz recording rate under VTune without a significant effect on performance.

  1. Hinted synchronization. While profilers can track mutexes and other sync primitives, some code does not look like a synchronization point, while it actually is. RAII style is very natural:
void BatchedThreadedNnet3CudaOnlinePipeline::WaitForLatticeCallbacks() {
  KALDI_PROF_SYNC_WAIT(BatchedThreadedNnet3CudaOnlinePipeline::WaitForLatticeCallbacks) {
    while (n_lattice_callbacks_not_done_.load() != 0) {
      Sleep(kSleepForCallBack);
    }
  }
}    
  1. User allocation tracking. Should be helpful for our CuAllocator. Every call is paired with an allocation, freeing or reallocation of a resource (normally memory, but can be stretched to anything). Every tracker is an initialized object (a member of CuAllocator), and has a begin/end call to also track time spent in the allocator, not only leaks:
class CuAllocator {
  // ifndef KALDI_PROFILE, declares an inline member returning
  // a global no-op interface object, to avoid any runtime or 
  // (hope the compiler is smart) object size impact. like
  //     const profiling::AllocTracker& alloc_tracker_ = profiling::null_tracker;
  KALDI_PROF_ALLOC_TRACKER(alloc_tracker_);

  // RAII style. Concession: separate Malloc to Malloc + MallocImpl to avoid double-counting.
  inline void* MallocLocking(size_t size) {
    auto track = alloc_tracker.TrackAlloc(size);
    std::unique_lock<std::mutex> lock(mutex_);
    return track(MallocImpl(size));
  }
};

@danpovey , @jtrmal , LMK what do you think: is it a good idea to help the profiler profile in general, and whether the design is at least OK. Some ugly templating is inevitable for zero-impact when profiling is off, but I'm not a big fan of it, at the least pre-C++17, will keep it to a minimum.

danpovey commented 3 years ago

I don't know enough about profilers to understand how easy it would be (and whether it would be possible) to make the same macros work for other profilers. Do most of those things have counterparts in other profilers, and can you name some profilers that we might use? I don't know how much of a hassle VTune is to download and install? (Although I guess this matters little since few people would be profiling).

On Tue, Jun 1, 2021 at 10:27 AM kkm000 @.***> wrote:

In CUDA code, we have NVTX calls wrapped in macros. There is no counterpart in CPU code, unfortunately. I'm flying blind.

I want to add more or less profiler-independent macros at strategic locations. The general design idea is like this.

I'm going to conceptually base it on the Intel ITT API, because VTune is a very powerful profiler, supported on all platforms (Windows, Mac, Linux, Android, even SoC), available for free, integrating very well in VS, and (never tried) Eclipse, has standalone GUI Web server with a support for profiling on a local or remote system (w.r.t. the Web server location; remote over ssh), can be operated from command line, imports Linux perf and ftrace and Windows ETW traces, supports comparison against stored baselines, and has common file format for all platforms, so that e.g. profile data can be collected on a headless Linux server and moved to Windows and analyzed in Visual Studio. Command-line analysis is also available.

At the base level, a profiler does not need special marker calls, but the API helps enhance the analysis greatly (cf. CUDA NVTX). Debugging and perfing such a dynamic system as CUDA decoder on the CPU side certainly needs a hand from the developer.

My idea is to wrap ITT calls into generalized macros, which can be defined for other profiling APIs. A global define -DKALDI_PROFILE would enable them; otherwise they would expand to nothing. To specialize the calls for ITT, -DHAVE_INTEL_ITT should be defined (these are orthogonal, except -DKALDI_PROFILE would warn that no concrete implementation is available, once, in the .cc file).

I would define the macros in the utils/kaldi-profile.{h,cc}. The header defines the macros, the .cc contains implementations if needed. The header should be included into files that have profiling calls.

There are several nested bracketing calls, and other which mark singular events. Some concepts, such as ITT Domains, which associate the events with the whole application, need not be exposed; the kaldi-profile.cc would just initialize a single domain and associate all events to it.

ITT defines the following groups of profiler hint calls.

Whenever feasible, RAII counterparts to profiling events should be defined, so that begin/end of a ROI is marked by the lifetime of an object.

  1. Collection control. Collection can be paused and resumed, most elementary controls.

KALDI_PROF_PAUSE();KALDI_PROF_RESUME();

  1. Thread naming. This has been a huge sticking point for me, as all threads in CUDA decoder are unnamed, and there are 3 types of them. Nothing fancy here, a thread just gets a name visible in VTune UI, htop and gdb. Bonus offered by ITT, which may not be offered by other profilers: a thread can be set to ignored, and won't be shown. Linux limits the names to 15 characters.

KALDI_PROF_THREAD_NAME("CuDecH2HCopy");KALDI_PROF_THREAD_IGNORE();

  1. Tasks. A task is an activity not spanning a thread. Tasks can be nested, but not intersect. RAII is possible. Naturally, a trivially defined RAII object is optimized out by a compiler. Implementation:

KALDI_PROF_TASK_IN(DeterminizeLattice); . . .KALDI_PROF_TASK_OUT(DeterminizeLattice);

Bur RAII style is more natural here:

KALDI_PROF_TASK(DeterminizeLattice) { . . . }

Note no quotes around strings; these are stringized and interned to IDs for performance.

1.

Frames. Activities possibly spanning threads, so no RAI, just like the first task example. The name "frame" is not the best; may be called " Activity". 2.

Events. Mark named points in time. The analysis shows time between consecutive points. ITT requires events declared before use. 3.

Counters. Declared global-ish objects (e.g., class- or global-scoped) that can be atomically incremented, decremented or set. Changes are recorded in the profile.

Unlike logging, the events of types 3 to 6 above support 1-2 kHz recording rate under VTune without a significant effect on performance.

  1. Hinted synchronization. While profilers can track mutexes and other sync primitives, some code does not look like a synchronization point, while it actually is. RAII style is very natural:

void BatchedThreadedNnet3CudaOnlinePipeline::WaitForLatticeCallbacks() { KALDI_PROF_SYNC_WAIT(BatchedThreadedNnet3CudaOnlinePipeline::WaitForLatticeCallbacks) { while (n_lattice_callbacks_notdone.load() != 0) { Sleep(kSleepForCallBack); } } }

  1. User allocation tracking. Should be helpful for our CuAllocator. Every call is paired with an allocation, freeing or reallocation of a resource (normally memory, but can be stretched to anything). Every tracker is an initialized object (a member of CuAllocator), and has a begin/end call to also track time spent in the allocator, not only leaks:

class CuAllocator { // ifndef KALDI_PROFILE, declares an inline member returning // a global no-op interface object, to avoid any runtime or // (hope the compiler is smart) object size impact. like // const profiling::AllocTracker& alloctracker = profiling::null_tracker; KALDI_PROF_ALLOC_TRACKER(alloctracker);

// RAII style. Concession: separate Malloc to Malloc + MallocImpl to avoid double-counting. inline void* MallocLocking(size_t size) { auto track = alloc_tracker.TrackAlloc(size); std::uniquelock lock(mutex); return track(MallocImpl(size)); } };


@danpovey https://github.com/danpovey , @jtrmal https://github.com/jtrmal , LMK what do you think: is it a good idea to help the profiler profile in general, and whether the design is at least OK. Some ugly templating is inevitable for zero-impact when profiling is off, but I'm not a big fan of it, at the least pre-C++17, will keep it to a minimum.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6OJE6D5ECD2UGFXQTTQRAQJANCNFSM453XTIFA .

kkm000 commented 3 years ago

I can name gprof, the GNU profiler, and Visual Studio Profiler, but they do not have the hint API, AFAIK, like NVTX/ITT. You can turn collection on/off, but that is usually done externally (a list of functions). I'm not aware of anything XCode uses. VTune is a very advanced tool.

The hassle of download and installing VTune on Debian is apt install intel-onemkl-vtune, if you nave already set up a repo for oneAPI. If you have installed DKMS, it will register its drivers with it and compile (but if you use an NVIDIA GPU, you have installed DKMS, as it also compiles NVIDIA drivers). On Windows, this is an .msi file, just like everything on Windows.

I do not expect these hints used a lot. Any profiler will tell you hotspots etc. If not for the GPU, you do not need much else: when the CPU overloads, well, it's just works at 100% capacity. It's the heterogenicity of GPU code that makes it awfully hard to debug and profile. This is why NVTX exists (not that we use much of it). It has ranges (analogous to ITT tasks), events with optional structured data (close to ITT events and counters), API to name things (threads, just like ITT, but also other CUDA-specific objects like streams, etc).

When a task goes in or out of the thread pool, I want to know how many are there. When the batch is started in the GPU, I want to see how many lanes are filled. Most problems I'm debugging is the "clogs" in the process, which cause it to fail at an explosive speed. The thread pool load I can see with a ~1s granularity in htop. That's all fine, but with a fast GPU, like V100, I see the explosion happening at a literally nuclear speed, from last "normal" log message to the bunch of Kaldi's "increase --num-channels" warning to the total kaboom in under 15ms (and I'm using a logger with offload to a thread pool, or I could not be even be able to log). I need to catch the overload signal somehow, but I do not even know what metric to look at.

Besides, CuAllocator is totally opaque, so debugging memory allocations means turning it off, which changes the timing.

Going back to the profiling stuff, I'd start slowly, adding stuff as needed. I just put up the whole picture here for the ake of discussion; I won't rush to implement it all at once—version 0.1 tends to change to the point of being unrecognizable anyway.

On the same note, the design of going through CuDevice to the allocator is also cuda-memcheck-unfirendly (any thread that allocated memory must call Instantiate(), although there is no strict need for it, except for checking the multithreaded_ flag which is global anyway, which causes allocation of per-thread CuBLAS and other handles, which I then need to make sure to free to avoid false positives), so I had to quite hack it to use the CUDA profiler to hunt device memory leaks. We need to simplify stuff in that department, too.

kkm000 commented 3 years ago

I'd start with thread names. Looking in the debugger at 18 threads, 16 of which are named "gpu_server" and differ only by 5-digit TIDs, and trying to remember which of the numbers did what when I hit that assertion is kinda at the limit of my feeble brainpower. I'll either add a thread-naming API, or start attaching sticky notes on the monitor.