Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach - Githubissues

joapolarbear / dl_notes

1 stars 1 forks source link

Computational Performance Predictions for Deep Neural Network Training: A Runtime-Based Approach #33

Open joapolarbear opened 3 years ago

joapolarbear commented 3 years ago

joapolarbear commented 3 years ago

Target

Predict the performance of a DNN training workload before running it on some GPUs
There exists a tradeoff between cost and performance when selecting a specific GPU for training —> cost-normalized performance: throughput divided by the hourly cost of renting the hardware. —> help make GPU selections

Challenges:

some DNN operations are implemented using different GPU kernels on different GPUs
1. Kernel-alike and kernel-varying operations
2. the underlying software libraries used during training (e.g., cuDNN [62], cuBLAS [65]).
Slow profiling process
1. cache, only measure "important" kernels

Design

Two steps

measure the execution time of a training iteration on an existing GPU, (Use CUPTI to record kernel level traces)
scale the measured execution times of each individual operation onto a different GPU using either wave scaling or pre-trained multilayer perceptrons (MLPs)

Two methods

wave scaling, a technique based on a GPU’s execution model,
1. using scaled ratios between the (i) number of compute units on each GPU, and (ii) their memory bandwidths
2. $\gamma$ represents the "memory bandwidth boundedness" of the kernel, large value means memory bound, small one means computation bound

pre-trained multilayer perceptrons (MLP)
For example, Conv2D, (i) batch size, (ii) number of input and output channels, (iii) kernel size, (iv) padding, (v) stride, and (vi) image size

Results

predictions with an average error of 11.8%

Alternative methods:

use the ratio between the peak floating point operations per second (FLOPS) of two GPUS or the ratio between the number of CUDA core on each GPU —> assume that a DNN training workload can exhaust all the computational resources on a GPU.