joapolarbear / dl_notes

1 stars 1 forks source link

Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach #33

Open joapolarbear opened 3 years ago

joapolarbear commented 3 years ago

Reading Notes

joapolarbear commented 3 years ago

Target

Challenges:

  1. some DNN operations are implemented using different GPU kernels on different GPUs
    1. Kernel-alike and kernel-varying operations
    2. the underlying software libraries used during training (e.g., cuDNN [62], cuBLAS [65]).
  2. Slow profiling process
    1. cache, only measure "important" kernels

Design

Two steps

  1. measure the execution time of a training iteration on an existing GPU, (Use CUPTI to record kernel level traces)
  2. scale the measured execution times of each individual operation onto a different GPU using either wave scaling or pre-trained multilayer perceptrons (MLPs)

Two methods

  1. wave scaling, a technique based on a GPU’s execution model,
    1. using scaled ratios between the (i) number of compute units on each GPU, and (ii) their memory bandwidths
    2. $\gamma$ represents the "memory bandwidth boundedness" of the kernel, large value means memory bound, small one means computation bound

image

  1. pre-trained multilayer perceptrons (MLP)

  2. For example, Conv2D, (i) batch size, (ii) number of input and output channels, (iii) kernel size, (iv) padding, (v) stride, and (vi) image size

Results

Alternative methods:

use the ratio between the peak floating point operations per second (FLOPS) of two GPUS or the ratio between the number of CUDA core on each GPU —> assume that a DNN training workload can exhaust all the computational resources on a GPU.