Closed dagamayank closed 7 years ago
Hi @dagamayank
The ShortCutAdd Layer is a composite layer that wraps around two separate layers (F and G), and adds the results of both. This is used a lot in ResNet. Each Layer ShortCutF
and ShortCutG
can also be a composite layer, e.g. Sequential
.
The time shown for any composite layer (e.g. Sequential
, ShortCutF
, ShortCutG
, ShortCut
), is the total time taken for the layers .forward(...)
call, which includes the recursive execution of all sublayer's .forward()
, as well as synchronization overhead. For example, the ShortCut
layer time should be a little more than the sum of ShortCutF
and ShortCutG
.
For analyzing the runtime of different layers, I usually first filtered out all composite layers. The summarize.sql
script does this by removing all Sequential
and ShortCut
layers.
@patflick Thanks for responding and reminding again about the summarize.sql
script.
I understood after following through the code that ShortCut*
and Sequential
are composite layers. However, I am still not 100% whether the current design is introducing extra overheads or not. For example if we look at the internals of Resnet, the current design is the following:
The same is corroborated by the summary.tsv
generated by the summarize.sql
script. From the script if I sum the SumTime
of all the layers (call it 'x') and compare it with the Sum(Model_fwd, Model_bwd)
(call it 'y'), I notice that y = 1.25 * x. This means there is 25% overhead coming from somewhere "outside" of the individual layers. It is hard to imagine the synchronization overhead to be 25%. Are we double-counting something in the "Model" times?
The overhead measured in the Model includes multiple things:
these are executed via the toc()
and log_step()
functions after every single layer.
To get the model timing without the overhead, you can disable the layerwise logging during the build process with:
make LAYER_TIMING=0
This will then report the timings only for the whole model forward/backward passes. This is also the time we used to compare the runtimes against NVIDIA (CuDNN etc).
Thanks @patflick. Makes sense now.
@patflick Hi, not sure if you are still maintaining this repo.
I had a question about how Resnet is being timed, particularly what time is accounted for in the "ShortCut" layers. There is a block of layers grouped in ShortCut as per the code but the logs show separate times for layers in the ShortCut + time for ShortCut itself. Example below:
Can you please explain where is the time spent in ShortCutF in this case? Is this the time for running the last layer + synchronization overheads? Or is it something else?
/cc @soumith