patflick / miopen-benchmark

benchmarking miopen
BSD 3-Clause "New" or "Revised" License
17 stars 15 forks source link

Benchmarking info from Resnet #3

Closed dagamayank closed 7 years ago

dagamayank commented 7 years ago

@patflick Hi, not sure if you are still maintaining this repo.

I had a question about how Resnet is being timed, particularly what time is accounted for in the "ShortCut" layers. There is a block of layers grouped in ShortCut as per the code but the logs show separate times for layers in the ShortCut + time for ShortCut itself. Example below:

Conv(3x3,pad=1,s=1)   (16,64,56,56)->(16,64,56,56)
BatchNorm()   (16,64,56,56)->(16,64,56,56)
ReLU() (16,64,56,56)->(16,64,56,56)
Conv(1x1,pad=0,s=1)   (16,64,56,56)->(16,256,56,56)
BatchNorm()   (16,256,56,56)->(16,256,56,56)
ShortcutF

Can you please explain where is the time spent in ShortCutF in this case? Is this the time for running the last layer + synchronization overheads? Or is it something else?

/cc @soumith

patflick commented 7 years ago

Hi @dagamayank

The ShortCutAdd Layer is a composite layer that wraps around two separate layers (F and G), and adds the results of both. This is used a lot in ResNet. Each Layer ShortCutF and ShortCutG can also be a composite layer, e.g. Sequential.

The time shown for any composite layer (e.g. Sequential, ShortCutF, ShortCutG, ShortCut), is the total time taken for the layers .forward(...) call, which includes the recursive execution of all sublayer's .forward(), as well as synchronization overhead. For example, the ShortCut layer time should be a little more than the sum of ShortCutF and ShortCutG.

For analyzing the runtime of different layers, I usually first filtered out all composite layers. The summarize.sql script does this by removing all Sequential and ShortCut layers.

dagamayank commented 7 years ago

@patflick Thanks for responding and reminding again about the summarize.sql script.

I understood after following through the code that ShortCut* and Sequential are composite layers. However, I am still not 100% whether the current design is introducing extra overheads or not. For example if we look at the internals of Resnet, the current design is the following:

The same is corroborated by the summary.tsv generated by the summarize.sql script. From the script if I sum the SumTime of all the layers (call it 'x') and compare it with the Sum(Model_fwd, Model_bwd) (call it 'y'), I notice that y = 1.25 * x. This means there is 25% overhead coming from somewhere "outside" of the individual layers. It is hard to imagine the synchronization overhead to be 25%. Are we double-counting something in the "Model" times?

patflick commented 7 years ago

The overhead measured in the Model includes multiple things:

these are executed via the toc() and log_step() functions after every single layer.

To get the model timing without the overhead, you can disable the layerwise logging during the build process with:

make LAYER_TIMING=0

This will then report the timings only for the whole model forward/backward passes. This is also the time we used to compare the runtimes against NVIDIA (CuDNN etc).

dagamayank commented 7 years ago

Thanks @patflick. Makes sense now.