Closed rickstaa closed 3 years ago
We can inspect the script using the python Cprofiler:
python -m cProfile -o myscript.cprof myscript.py
This resulted in the following report:
This report can be opened using the pyprof2calltree -k -i torch_train.cprof
command. For this, you need to install the kcachegrind
Debian package.
From this report, we can see that most of the time is spent running the Adam optimiser and Network forward functions.
Below are the results of running the torch.utils.bottleneck utility.
According to Takizawa et al. 2009, This only improves speed if your hidden layers contain more than 128 neurons. However, this article was published in 2009 and has an example that was written in C++. Also, judging from this issue and the paper of Brito et al. 2016 it should not matter much for our small network. However, if we look at the Torch.utils.bottleneck report we can see that the script might be faster if we put both the network and replay buffer on the GPU.
For the current network training on the GPU is slower:
This might be caused because we need to write the return values from the learn()
function from the GPU to the CPU to be able to compute the diagnostics. In the spinning up version, this is done during logging thus not at every SGD.
The results of the GPU bottleneck report (nvprof --print-gpu-trace python train.py
) are:
The results of the system calls bottleneck report (strace -fcT python training.py -e trace=open, close, read
) are:
Torch uses the float32
dtype as the default in tensorflow the dtype of the placeholders is also defined to be float32
. Further any float64
values that are returned from the environment (because numpy uses float64
) are casted into float32
when the transition dictionary is created in the replay buffer:
transition = {'s': s, 'a': a, 'd': d,'raw_d':raw_d, 'r': np.array([r]), 'terminal': np.array([terminal]), 's_': s_}
As seen from #17 we can see that the LAC_TF2_CLEANED_EAGER_SPEEDUP
is currently the fastest TF version. As a result, we use this version to compare with TensorFlow. For the environment, we choose the oscillator as it is relatively easy. During the comparison, we will do five rollouts of 1.1e4 steps
.
As torch and TF use different random number generators, it is not possible to make both environments entirely similar and deterministic. The only thing we can make similar is the initial environment state. It is therefore crucial that we remember that the initial weights and policy noise cause some of the speed differences. I will try to use different RANDOM seeds to negate this difference.
RANDOM_SEED = 0 | run | Execution time |
---|---|---|
1 | 45.05048489570618 | |
2 | 43.75153851509094 | |
3 | 43.895511865615845 | |
4 | 45.38190507888794 | |
5 | 43.76770758628845 | |
mean | 44.36942958831787 |
RANDOM_SEED = 0
run | Execution time |
---|---|
1 | 84.154057264328 |
2 | 84.08330941200256 |
3 | 84.85945796966553 |
4 | 83.90588426589966 |
5 | 83.34130358695984 |
mean | 84.06880249977112 |
From this, it looks like a torch is about 2 times slower than TensorFlow but we have to keep in mind that we used the tf.function wrapper in the TensorFlow version. This wrapper compiles the python code into a static graph before executing the code. When we disable this compilation the Tensorflow versions becomes significantly slower than the PyTorch version:
run | Execution time |
---|---|
1 | 1249.4554846286774 |
2 | 1248.4554846286774 |
3 | 1235.324506999488 |
mean | 1244.4118254189477 |
This leads to an unfair comparison. To have a fair comparison, we first have to optimize the Torch code using the equivalent torchscript wrapper. This feature is added in #22.
As can be seen from the report above we can see that although Pytorch
is significantly faster than TF2.0
in the pure eager mode, it is slower than TF2.0
when the tf.function
wrapper is used. This is as expected as with this wrapper compiles the enclosed python code to a static graph. In this comment, I will try to investigate if it is possible to speed up the Pytorch
version using the equivalent torchscript wrapper. I will further try to see if there are some other optimisations I can do on the Pytorch code.
Let's look at possible speed bottlenecks in the PyTorch code by inspecting the spyder profiler reports. We do this by letting one agent perform 1.1e4
steps in the Oscillator environment. We do this while disabling the GPU.
RANDOM_SEED = 0
run | Execution time |
---|---|
1 | 52.047799587249756 |
2 | 51.3368079662323 |
3 | 50.19379186630249 |
4 | 50.3440146446228 |
5 | 50.664284467697144 |
mean | 50.9173397064209 |
RANDOM_SEED = 0
run | Execution time |
---|---|
1 | 83.62926721572876 |
2 | 82.74273562431335 |
3 | 81.42218828201294 |
4 | 82.77582883834839 |
5 | 82.26928067207336 |
mean | 82.56786012649536 |
Let's compare the TF2.0 and Pytorch reports.
Execution time: 51.663371562957764
From this, we can see that the Pytorch code is on average 1.62 times slower than the optimised TF2.0 code. While making this comparison, it is important to note that in the TF2.0 version the following methods were wrapped with the tf.function wrapper and will therefore show up under the __call__
method:
choose_action
learn
target_init
update_target
LyapunovCritic.call
(Forward and backward pass)GaussianActor.call
(Forward and backward pass)First of all, we see that the step
function is faster in the PyTorch code. This seems to be caused because the amax
function takes less time in the Pytorch version. This is unexpected as the same python, NumPy and environment are used. Could be an error in the profiler.
Here the results between PyTorch and TensorFlow are similar. This is as expected as the code was not changed.
For these methods, TensorFlow is faster as the tf.function wrapper is used. We further can see that most of the speed in the Pytorch version is lost during the forward and backward pass. In these functions the following methods take the most time:
log_prob
: 3.67 srsample
: 1.72linear
: 10.28 srelu
: 2.65 sAfter that, it is the update_target
method that takes the most time.
To speed up the Pytorch version we can first check if there are changes we can make to the python code of the bottlenecks. After that, we can try to use the torchscript wrapper to create a static graph.
I looked on the GitHub, and there is currently no bug regarding the speed of the log_prob
function. I also did not find ways to improve the speed of this method.
There is currently are two issues that discuss a slowness in the rsample method:
They are not related to our use-case but might lead to a speed improvement in the rsample
method in the future. I, however, for now, could not find a way to speed up this method in eager mode.
There are no issue open about speed problems with the nn.linear and nn.relu methods. I also did not find a way to speed this without using the torchscript wrapper.
Let's try to speed up the bottleneck functions using the torchscript wrapper. I will do this in the LAC_CLEANED_TORCH_SPEEDUP
folder.
The Torchscript module in the stable Pytorch does not yet have support for the @unused and @ignore decorators when working with normal python classes. As a result, when using pytorch 1.6 I can not speed up the choice_action, update_target and learn methods without rewriting them to separate functions (See this issue). This feature will, however, be included in Pytorch 1.7. In Pytorch 1.7 we still have the problem that we can not yet convert the GaussianActor to a TorchScript (see below) and thus can not convert the choose_action, learn and update_target functions as all functions inside a Torchscript should be TorchScript compatible.
The Torchscript module can be used to speed up the LyapunovCritic.call
method. When doing this, it doesn't change the execution time by a lot (improves 1-2 seconds).
The Torchscript module does not yet work with the sampling action we use in our Gaussian Actor (See https://github.com/pytorch/pytorch/issues/29843 and https://github.com/pytorch/pytorch/issues/18094). We can therefore not fully translate the forward method of the gaussian actor to a Torchscript. We can however apply the [Torchscript wrapper]((https://pytorch.org/docs/stable/jit.html) to the nn.linear layers.
run | Execution time |
---|---|
1 | 83.89904952049255 |
2 | 85.36705589294434 |
3 | 86.45388126373291 |
4 | 85.51795291900635 |
5 | 85.36600470542908 |
mean | 85.32078886032104 |
It appears that changing the forward function of the LyapunovCritic
and the linear layers of the GaussianActor
into TorchScripts does not improve the performance. As explained in this forum post, this is not strange as these features already use the C++ backend and do therefore not benefit from the optimisations done in the TorchScript module like fusing operations. As a result, only applying this wrapper to the forward method of the GaussianActor
or even the full learn
method might yield improved performance.
Using the GPU instead of the CPU with our small networks does not improve or decrease the speed much.
Sometimes enabling the cuDNN auto-tune
algorithm speeds up the training (see this forum post). In our case, this did not help.
As a final step lets compare the speed of the Forward/backward pass and the distributions sample actions. For this, we can create some small dummy timing scripts. These scripts can be found in the sandbox/speed_comparison
folder.
Let's compare the speed of the sampling action in the Gaussian action between Pytorch and Tensorflow. For this, we use the timeit_sample_speed_compare.py
script. In this script, we take 5e5
samples/rsamples from the normal distribution.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.269880842999555 | 15.277080029998615 |
2 | 46.73242047799795 | 15.733279996999045 |
3 | 47.41759105299934 | 15.471710162000818 |
mean | 47.13996412466562 | 15.494023396332826 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU). In PyTorch, this is done using the following command:
torch.set_default_tensor_type('torch.cuda.FloatTensor') # Enable global GPU
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 43.95948519900048 | 31.157700912997825 |
2 | 45.31296913999904 | 30.35345944400251 |
3 | 44.46277614200153 | 29.964271681001264 |
mean | 44.57841016033368 | 30.491810679333867 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag. In PyTorch, this is done using the following command:
torch.backends.cudnn.benchmark = True
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 44.35301048200199 | 29.58910515900061 |
2 | 44.9747937630018 | 29.630526167995413 |
3 | 45.64101090400072 | 29.92008007500408 |
mean | 44.98960504966817 | 29.92008007500408 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag. In PyTorch, this is done using the following command:
torch.backends.cudnn.fastest = True
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 44.429974582002615 | 31.8192962969988 |
2 | 44.92464860399923 | 30.395992316996853 |
3 | 44.754384061001474 | 30.31036730899359 |
mean | 44.754384061001474 | 30.84188530766308 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 43.431389666000086 | 15.98217932800003 |
2 | 43.28244902200004 | 15.648057652000034 |
3 | 41.674334185000134 | 15.587252373000183 |
mean | 42.79605762433342 | 15.739163117666749 |
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 49.16316856100093 | 17.141713114000595 |
2 | 49.98035491799965 | 17.08107351599756 |
3 | 50.76262897200286 | 17.62022704699848 |
mean | 49.96871748366781 | 17.62022704699848 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.19858287099851 | 32.428321216997574 |
2 | 47.84914048400242 | 32.55513772999984 |
3 | 48.16133704599997 | 32.596067703998415 |
mean | 48.069686800333635 | 32.52650888366528 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.682932835006795 | 32.69532821299799 |
2 | 48.566335363997496 | 33.00475709999591 |
3 | 49.04837610800314 | 33.51344140100264 |
mean | 48.43254810233581 | 33.07117557133218 |
Let's enable the cuDNN benchmark flag.
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.537661023998226 | 33.475586044995 |
2 | 48.195051800998044 | 32.84605344099691 |
3 | 47.35034869299852 | 32.90194606300065 |
mean | 47.6943538393316 | 33.07452851633085 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.47512355899994 | 17.723089199000015 |
2 | 47.105311723999876 | 17.64742597999998 |
3 | 48.08153719200027 | 17.38713361100008 |
mean | 47.55399082500003 | 17.585882930000025 |
From these results, we can see that PyTorch is significantly faster than TensorFlow at sampling actions. Further, we see that using GPU doesn't affect the execution time of Tensorflow while for PyTorch, the execution time gets longer. The execution time, however, stays significantly lower for the PyTorch version even when GPU is enabled. This execution time does not improve while enabling the cuDNN benchmark and fastest flags.
Now, in addition to this sampling operation, let's also calculate the log_probabilities of the sampled action.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 50.36903019700185 | 47.872389495998505 |
2 | 51.30757960200208 | 46.29049172399755 |
3 | 49.60291533399868 | 47.005423079997854 |
mean | 50.42650837766754 | 47.0561014333313 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.91270494000128 | 100.27575907999926 |
2 | 47.13373410799977 | 98.49456620199999 |
3 | 46.99059675100216 | 99.80839579399981 |
mean | 47.67901193300107 | 99.52624035866636 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.4585376219984 | 101.22743981500389 |
2 | 45.78058913299992 | 97.0079226279995 |
3 | 47.77440226700128 | 99.20072890699521 |
mean | 48.68052208800024 | 99.14536378333287 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.1281833979956 | 101.55883601200185 |
2 | 47.37521640000341 | 100.41289296699688 |
3 | 47.830623365000065 | 99.11197711800196 |
mean | 47.77800772099969 | 100.3612353656669 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.37839867199909 | 48.91452547500012 |
2 | 47.48065972999757 | 46.51999894999972 |
3 | 46.97186975300065 | 46.97186975300065 |
mean | 47.27697605166577 | 47.468798059333494 |
From this, we can see that calculating the log_probability from a sampled action is slightly faster in Pytorch when it is done on the CPU. The execution time increases in the Pytorch when the calculations are done on the GPU. For TensorFlow, the execution time decreases when the calculations are performed on the GPU. Additionally, this result also suggests that the stand-alone log_probability calculation is slower in Pytorch than in TensorFlow as the sampling operation is faster in Pytorch, but the total time is the same.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 45.78058913299992 | 38.37326285200015 |
2 | 44.57832863599924 | 38.93101751900031 |
3 | 45.58136013500007 | 38.56264149399976 |
mean | 45.31342596799974 | 38.62230728833341 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.07486991299993 | 88.60940667599971 |
2 | 47.214407829000265 | 86.43189353300022 |
3 | 47.78040611400047 | 85.64942697800052 |
mean | 47.68989461866689 | 86.89690906233348 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 45.30441331600014 | 41.29998277499999 |
2 | 45.31698288100051 | 38.515963581000506 |
3 | 44.346509671000604 | 39.96837626400065 |
mean | 44.98930195600042 | 39.92810754000038 |
From the above results, we can see that calculating the LogProbability is slightly faster in Pytorch when using CPU mode. When GPU mode is used Pytorch becomes slower. This result is expected from the results above. It also shows that the contribution of the sampling action to the execution time is smaller than the log_probability.
Now in addition to calculating the log_probability of the sampled action lets also squash the computed action while applying a correction for this squashing on the calculated log_probabilities.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 50.94963947499855 | 79.64743485200233 |
2 | 50.60395075799897 | 78.01757510300013 |
3 | 50.43615571299961 | 80.14881448499727 |
mean | 50.66324864866571 | 80.14881448499727 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 50.378254114999436 | 151.5695584310015 |
2 | 48.73778436000066 | 153.44167755200033 |
3 | 48.58900336899751 | 150.91326868199758 |
mean | 49.2350139479992 | 151.97483488833313 |
GPU GLOBAL cudnn.benchmark
Let's enable the benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 49.12833458800014 | 152.07146545599971 |
2 | 50.353105712005345 | 151.1550173670039 |
3 | 48.28671533699526 | 154.40869215199928 |
mean | 49.256051879000246 | 152.54505832500095 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.219657355999516 | 156.77407888899324 |
2 | 47.961540133997914 | 152.9459018090056 |
3 | 47.866569441997854 | 153.34349197799747 |
mean | 48.0159223106651 | 154.35449089199878 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.151476297000045 | 80.33919906399933 |
2 | 47.25795930499953 | 77.96675552500164 |
3 | 47.56391130200063 | 79.42569091699988 |
mean | 47.657782301333405 | 79.24388183533362 |
The results above show that the Pytorch version becomes slower than the TensorFlow version when we add the Squash operation. Further, it again shows that enabling the GPU increases the execution time for the Pytorch version while it doesn't change anything for the TensorFlow version.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 44.14887383600035 | 70.58557372399991 |
2 | 47.06251954599975 | 70.10898442500002 |
3 | 44.60933404100069 | 67.97248776999913 |
mean | 45.27357580766693 | 69.5556819729997 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.64849690200026 | 138.98152366400063 |
2 | 49.2268215790009 | 137.38257430200065 |
3 | 47.322883758999524 | 139.86679378100052 |
mean | 48.06606741333356 | 138.74363058233394 |
Enabling the cuDNN benchmark or fastest flags does not change the speed. TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.66489131900016 | 69.29213760200037 |
2 | 49.14248128899999 | 68.04381407999972 |
3 | 48.16409693700007 | 67.40893307699844 |
mean | 48.32382318166674 | 68.24829491966618 |
From the results above we see that the Squashing operation is indeed the bottleneck which causes the Pytorch version to be slower than the Tensorflow version. We, therefore, have to look if we can speed up this squashing in the Pytorch version.
Now let's compare a forward pass to the linear layers of the networks.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 375.18870995100224 | 320.4068507410011 |
2 | 369.548168835001 | 311.17305828499957 |
3 | 370.5892158940005 | 313.39369041700047 |
mean | 373.655545265335 | 314.9911998143337 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 229.82201903599343 | 223.34556611099833 |
2 | 224.82652258500457 | 223.9562453960025 |
3 | 227.6470035270031 | 222.9704135100037 |
mean | 227.43184838266703 | 223.4240750056682 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 238.45478896200075 | 224.40128824699786 |
2 | 235.52634897700045 | 224.19951804900484 |
3 | 230.21658990500146 | 225.0826409250003 |
mean | 234.73257594800089 | 224.56114907366768 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 231.63010256599955 | 219.30785131300217 |
2 | 223.39881358000275 | 225.78243027799908 |
3 | 228.4898673600037 | 228.4898673600037 |
mean | 227.839594502002 | 224.52671631700164 |
GPU customised (Without writing back to CPU)
Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 234.0862072489981 | 223.88553328599664 |
2 | 231.2203368430055 | 223.88553328599664 |
3 | 234.37278411599982 | 224.01825483499852 |
mean | 233.22644273600113 | 224.01825483499852 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
GPU customised (With writing back to CPU)
Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 225.9497257460025 | 276.2206137460016 |
2 | 231.38425711599848 | 273.4203483720048 |
3 | 230.72190967400093 | 274.2106625689994 |
mean | 229.35196417866732 | 274.61720822900196 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 205.74492095799724 | 313.0814373590001 |
2 | 223.43292250000013 | 333.1711495269992 |
3 | 230.10195052199924 | 324.7172065039995 |
mean | 219.75993132666554 | 323.6565977966663 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
From these results, we see that the forward pass is faster in the Pytorch version. The only case where the Tensorflow version is faster is when we write the data back to the CPU with every pass. This is as expected as this operation is computationally expensive, and the Tensorflow version keeps the data on the GPU.
Now let's compare a forward pass through the whole networks, which means including the sampling action, log probability and squashing action.
CPU GLOBAL
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 411.22191030799877 | 536.3658260090015 |
2 | 412.4232401260015 | 513.8089201249968 |
3 | 414.8655814080048 | 525.9684370820032 |
mean | 412.8369106140017 | 525.3810610720005 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 336.8346674320055 | 432.64605384600145 |
2 | 337.65133418299956 | 432.2832743270046 |
3 | 335.77879138099524 | 428.30122067600314 |
mean | 336.7549309986668 | 431.0768496163364 |
Enabling benchmark and fastest does not change the speed.
GPU customised (Without writing back to CPU)
Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 336.0141714460042 | 414.5644553680031 |
2 | 339.7367062159901 | 418.4468067179987 |
3 | 343.35170658299467 | 427.59492996100744 |
mean | 339.7008614149963 | 420.20206401566975 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
GPU customised (With writing back to CPU)
Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 330.5518290850014 | 487.8034812370024 |
2 | 341.3577948439997 | 487.35418161000007 |
3 | 343.21883232100026 | 496.9929247519999 |
mean | 338.3761520833338 | 490.7168625330008 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 335.62528117199963 | 512.721542063 |
2 | 337.9906162549996 | 515.8002393119996 |
3 | 341.41880026500075 | 526.5174871430008 |
mean | 338.34489923066667 | 518.3464228393335 |
From the results above, we see that for the full forward pass Tensorflow is significantly faster. As explained above, this is probably caused by the slower execution speed of the squash operation.
From the results above we can conclude the following:
leran
method.Let's examine what could cause the squashing operation to be slower in the Pytorch version.
I double-checked, and both the Pytorch and Tensorflow solutions are equal. The only difference can be found in the way the two frameworks calculate the log_probabiliy of the sampled actions that come from the SquashedGaussian.
In PyTorch, these log probabilities are calculated by calculating them from the non-squashed distribution and then applying a correction for the Tanh squashing.
Tensorflow uses the bijectors and TransformedDistribution modules to achieve the same.
As seen above the two Frameworks use the same formulas for calculating the squashed log probabilities.
There are two possible candidates which cause the extra execution time for the Pytorch version:
Lets first get a baseline by measuring the log_prob execution time. In these tests, Pytorch is using CPU while Tensorflow uses the default device (GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 130.1151521205902 | 96.50015997886658 |
2 | 133.6686074733734 | 96.55137491226196 |
3 | 129.1776819229126 | 95.73350787162781 |
mean | 130.87938650449118 | 96.26168092091878 |
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 130.4717721939087 | 107.03060293197632 |
2 | 130.64691376686096 | 105.87263202667236 |
3 | 134.26840901374817 | 105.66786861419678 |
mean | 131.79569832483926 | 106.19036785761516 |
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 133.60435914993286 | 162.03667068481445 |
2 | 130.88181281089783 | 158.7844672203064 |
3 | 129.7918701171875 | 160.77405977249146 |
mean | 131.42601402600607 | 160.5317325592041 |
From these results, we see that the biggest execution time drop is found when we squash the distribution. To speed up the Pytorch version we, therefore, have to look if we can make this operation more efficient. This speed difference is strange as the formula which is used for this is the same between the Pytorch and Tensorflow version:
Tensorflow version https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/73ad7d206a7d34438f9cf2883e2dee2c3edcded7/LAC_TF2_CLEANED_EAGER_SPEEDUP/squash_bijector.py#L26
As a result the speed difference can be caused by:
-=
operation or .sum()
method.tf.function
wrapper.tf.function
wrapper.x+=1
instead of x = x+1
.Let's perform one last investigation on how fast each version is (with the final code) before transferring to the MLC repository. To do this, I perform 0.8e4
steps in the oscillator environment.
CPU:
1: 37.805219650268555 s (8000 steps - seed 0)
2: 36.3125 s (8000 steps - seed 65453)
3: 36.72363305091858 s (8000 steps - seed 3453)
mean: 37 s
GPU:
The old version is using TensorFlow 1.13.0, which only supports Cuda 10.0. Cuda 10.0 is, however not supported on the os I'm using. I, therefore, had to test this on my old pc.
1: 44.91222882270813 s (8000 steps - seed 0)
2: 45.24729943275452 s (8000 steps - seed 65453)
3: 44.684839487075806 s (8000 steps - seed 3453)
mean: 37 s
CPU:
Seeds:
1: 65.01357102394104 s (8000 steps - seed 0)
2: 68.49877715110779 s (8000 steps - seed 65453)
3: 67.3913779258728 s (8000 steps - seed 3453)
mean: 67s
GPU:
Seeds:
1: 65.51777911186218 s (8000 steps - seed 0)
2: 65.25916171073914 s (8000 steps - seed 65453)
3: 65.9152410030365 s (8000 steps - seed 3453)
mean: 65 s
CPU:
1: 40.394707918167114 s (8000 steps - seed 0)
2: 52.60241150856018 s (8000 steps - seed 65453)
3: 71.55074095726013 s (8000 steps - seed 3453)
mean: 55 s
GPU:
1: 65.01357102394104 s (8000 steps - seed 0)
2: 66.7179274559021 s (8000 steps - seed 65453)
3: 74.95780944824219 s (8000 steps - seed 3453)
mean: 69 s
When we do a longer run of 1e5
steps, the difference in speed becomes more clear:
LAC_TORCH (GPU): 845.13680768013 s LAC_TORCH (CPU): 924.6164543628693 s LAC_TF2: 573.4957985877991 s
We see that training the script takes nearly 1.47 times as long when using Pytorch than when we use TF2.
When we look at the spyder reports, we see that the difference is mainly caused by the step
and learn/call
function.
This is, however very strange as the step function is the same in both versions. I, therefore, think the profiler was unable to correctly measure the time of the individual components in the TensorFlow case (which compiles the code that is wrapped by the tf.function
decorator). Let's try one more time using function trace.
For this, we can see that the CPU speed really depends on the random seed we use. The GPU speed is more stable. Further, we see that overall the old version is a little bit faster than the new version. It further looks like the TF2 version is a little bit faster than the Torch version. This is not strange as TensorFlow is using complied code (using the tf.function) decorator where PyTorch is using python code of which the low-level components are written in c++. If we disable the tf.function decorator the TensorFlow code becomes 10 times slower than Pytorch. We can however not be 100% certain as we did not use more seeds. The torch version when on GPU is, however, faster than the TensorFlow version when it is using GPU. This is as expected as for the PyTorch version I only put the big networks on the GPU where the TensorFlow version tries to do more things on the GPU. From the reports above we can also see that the biggest speed difference is found in the forward and backward passes.
A way to speed this up would use the torch script wrapper. This currently is not yet supported as torch script wrapper does not support the distributions that are used in the gaussian actor (see https://github.com/pytorch/pytorch/issues/18094 and https://github.com/pytorch/pytorch/issues/29843).
Describe the bug When we run the same experiment in torch and TF1 the torch version takes significantly longer.
To Reproduce Steps to reproduce the behaviour:
train.py
file in theLAC_CLEANED_TORCH
folder.train.py
file in theLAC_TF1_ORIGINAL
folder.Expected behaviour Similar speeds.
Screenshots For training 4 rollouts we get the following results. Left is torch right TensorFlow:
Desktop (please complete the following information):
Possible debug steps