rickstaa / LAC-TF2-TORCH-translation

Temporary repository to debug what goes wrong during the translation of the LAC algorithm from TF1 to Torch.
0 stars 1 forks source link

LAC Torch version is slower than TF1 #18

Closed rickstaa closed 3 years ago

rickstaa commented 4 years ago

Describe the bug When we run the same experiment in torch and TF1 the torch version takes significantly longer.

To Reproduce Steps to reproduce the behaviour:

  1. Run the train.py file in the LAC_CLEANED_TORCH folder.
  2. Run the train.py file in the LAC_TF1_ORIGINAL folder.

Expected behaviour Similar speeds.

Screenshots For training 4 rollouts we get the following results. Left is torch right TensorFlow:

image

Desktop (please complete the following information):

Possible debug steps

rickstaa commented 4 years ago

Profiler results

Cprofiler report

We can inspect the script using the python Cprofiler:

python -m cProfile -o myscript.cprof myscript.py

This resulted in the following report:

torch_train.zip

This report can be opened using the pyprof2calltree -k -i torch_train.cprof command. For this, you need to install the kcachegrind Debian package.

image

From this report, we can see that most of the time is spent running the Adam optimiser and Network forward functions.

Torch.utils.bottleneck report

Below are the results of running the torch.utils.bottleneck utility.

image image image

Possible improvements

Operations on GPU

According to Takizawa et al. 2009, This only improves speed if your hidden layers contain more than 128 neurons. However, this article was published in 2009 and has an example that was written in C++. Also, judging from this issue and the paper of Brito et al. 2016 it should not matter much for our small network. However, if we look at the Torch.utils.bottleneck report we can see that the script might be faster if we put both the network and replay buffer on the GPU.

Results

For the current network training on the GPU is slower:

image

image

This might be caused because we need to write the return values from the learn() function from the GPU to the CPU to be able to compute the diagnostics. In the spinning up version, this is done during logging thus not at every SGD.

GPU bottleneck report

The results of the GPU bottleneck report (nvprof --print-gpu-trace python train.py) are:

System calls bottleneck report

The results of the system calls bottleneck report (strace -fcT python training.py -e trace=open, close, read) are:

rickstaa commented 4 years ago

Check floating point precision

Torch uses the float32 dtype as the default in tensorflow the dtype of the placeholders is also defined to be float32. Further any float64 values that are returned from the environment (because numpy uses float64) are casted into float32 when the transition dictionary is created in the replay buffer:

transition = {'s': s, 'a': a, 'd': d,'raw_d':raw_d, 'r': np.array([r]), 'terminal': np.array([terminal]), 's_': s_}
rickstaa commented 4 years ago

LAC_CLEANED_TORCH vs LAC_TF2_CLEANED_EAGER_SPEEDUP speed comparison

As seen from #17 we can see that the LAC_TF2_CLEANED_EAGER_SPEEDUP is currently the fastest TF version. As a result, we use this version to compare with TensorFlow. For the environment, we choose the oscillator as it is relatively easy. During the comparison, we will do five rollouts of 1.1e4 steps.

Deterministic results

As torch and TF use different random number generators, it is not possible to make both environments entirely similar and deterministic. The only thing we can make similar is the initial environment state. It is therefore crucial that we remember that the initial weights and policy noise cause some of the speed differences. I will try to use different RANDOM seeds to negate this difference.

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

RANDOM_SEED = 0 run Execution time
1 45.05048489570618
2 43.75153851509094
3 43.895511865615845
4 45.38190507888794
5 43.76770758628845
mean 44.36942958831787

LAC_CLEANED_TORCH (Eager mode)

RANDOM_SEED = 0

run Execution time
1 84.154057264328
2 84.08330941200256
3 84.85945796966553
4 83.90588426589966
5 83.34130358695984
mean 84.06880249977112

Conclusion

From this, it looks like a torch is about 2 times slower than TensorFlow but we have to keep in mind that we used the tf.function wrapper in the TensorFlow version. This wrapper compiles the python code into a static graph before executing the code. When we disable this compilation the Tensorflow versions becomes significantly slower than the PyTorch version:

run Execution time
1 1249.4554846286774
2 1248.4554846286774
3 1235.324506999488
mean 1244.4118254189477

This leads to an unfair comparison. To have a fair comparison, we first have to optimize the Torch code using the equivalent torchscript wrapper. This feature is added in #22.

rickstaa commented 4 years ago

Torch version speed debug

As can be seen from the report above we can see that although Pytorch is significantly faster than TF2.0 in the pure eager mode, it is slower than TF2.0 when the tf.function wrapper is used. This is as expected as with this wrapper compiles the enclosed python code to a static graph. In this comment, I will try to investigate if it is possible to speed up the Pytorch version using the equivalent torchscript wrapper. I will further try to see if there are some other optimisations I can do on the Pytorch code.

Current profiling results

Let's look at possible speed bottlenecks in the PyTorch code by inspecting the spyder profiler reports. We do this by letting one agent perform 1.1e4 steps in the Oscillator environment. We do this while disabling the GPU.

Execution speed overview

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

RANDOM_SEED = 0

run Execution time
1 52.047799587249756
2 51.3368079662323
3 50.19379186630249
4 50.3440146446228
5 50.664284467697144
mean 50.9173397064209

LAC_CLEANED_TORCH (Eager mode)

RANDOM_SEED = 0

run Execution time
1 83.62926721572876
2 82.74273562431335
3 81.42218828201294
4 82.77582883834839
5 82.26928067207336
mean 82.56786012649536

Profiling comparison

Let's compare the TF2.0 and Pytorch reports.

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

Execution time: 51.663371562957764

image

image

image

image

LAC_CLEANED_TORCH (Eager mode)

image

image

image

image

image

image

image

image

Analysis

From this, we can see that the Pytorch code is on average 1.62 times slower than the optimised TF2.0 code. While making this comparison, it is important to note that in the TF2.0 version the following methods were wrapped with the tf.function wrapper and will therefore show up under the __call__ method:

Step method

First of all, we see that the step function is faster in the PyTorch code. This seems to be caused because the amax function takes less time in the Pytorch version. This is unexpected as the same python, NumPy and environment are used. Could be an error in the profiler.

Sample and Store method

Here the results between PyTorch and TensorFlow are similar. This is as expected as the code was not changed.

Learn and choice_action methods

For these methods, TensorFlow is faster as the tf.function wrapper is used. We further can see that most of the speed in the Pytorch version is lost during the forward and backward pass. In these functions the following methods take the most time:

After that, it is the update_target method that takes the most time.

Speed up PyTorch bottlenecks

To speed up the Pytorch version we can first check if there are changes we can make to the python code of the bottlenecks. After that, we can try to use the torchscript wrapper to create a static graph.

Python code improvements

Log_prop

I looked on the GitHub, and there is currently no bug regarding the speed of the log_prob function. I also did not find ways to improve the speed of this method.

Rsample

There is currently are two issues that discuss a slowness in the rsample method:

They are not related to our use-case but might lead to a speed improvement in the rsample method in the future. I, however, for now, could not find a way to speed up this method in eager mode.

Linear and Relu

There are no issue open about speed problems with the nn.linear and nn.relu methods. I also did not find a way to speed this without using the torchscript wrapper.

Pytorch related improvements (Apply Torchscript wrapper)

Let's try to speed up the bottleneck functions using the torchscript wrapper. I will do this in the LAC_CLEANED_TORCH_SPEEDUP folder.

Resources

Choose_action and update_target

The Torchscript module in the stable Pytorch does not yet have support for the @unused and @ignore decorators when working with normal python classes. As a result, when using pytorch 1.6 I can not speed up the choice_action, update_target and learn methods without rewriting them to separate functions (See this issue). This feature will, however, be included in Pytorch 1.7. In Pytorch 1.7 we still have the problem that we can not yet convert the GaussianActor to a TorchScript (see below) and thus can not convert the choose_action, learn and update_target functions as all functions inside a Torchscript should be TorchScript compatible.

Speed up LyapunovCritic.call method (Forward and Backward pass)

The Torchscript module can be used to speed up the LyapunovCritic.call method. When doing this, it doesn't change the execution time by a lot (improves 1-2 seconds).

Speed up GaussianActor.call method (Forward and Backward pass)

The Torchscript module does not yet work with the sampling action we use in our Gaussian Actor (See https://github.com/pytorch/pytorch/issues/29843 and https://github.com/pytorch/pytorch/issues/18094). We can therefore not fully translate the forward method of the gaussian actor to a Torchscript. We can however apply the [Torchscript wrapper]((https://pytorch.org/docs/stable/jit.html) to the nn.linear layers.

Effect of using TorchScript

run Execution time
1 83.89904952049255
2 85.36705589294434
3 86.45388126373291
4 85.51795291900635
5 85.36600470542908
mean 85.32078886032104

It appears that changing the forward function of the LyapunovCritic and the linear layers of the GaussianActor into TorchScripts does not improve the performance. As explained in this forum post, this is not strange as these features already use the C++ backend and do therefore not benefit from the optimisations done in the TorchScript module like fusing operations. As a result, only applying this wrapper to the forward method of the GaussianActor or even the full learn method might yield improved performance.

Use GPU

Using the GPU instead of the CPU with our small networks does not improve or decrease the speed much.

USE torch.cudnn.benchmar == True

Sometimes enabling the cuDNN auto-tune algorithm speeds up the training (see this forum post). In our case, this did not help.

Compare Forward/Backward and sample between PyTorch and TensorFlow

As a final step lets compare the speed of the Forward/backward pass and the distributions sample actions. For this, we can create some small dummy timing scripts. These scripts can be found in the sandbox/speed_comparison folder.

Compare sampling execution speed

Let's compare the speed of the sampling action in the Gaussian action between Pytorch and Tensorflow. For this, we use the timeit_sample_speed_compare.py script. In this script, we take 5e5 samples/rsamples from the normal distribution.

Sample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 47.269880842999555 15.277080029998615
2 46.73242047799795 15.733279996999045
3 47.41759105299934 15.471710162000818
mean 47.13996412466562 15.494023396332826

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU). In PyTorch, this is done using the following command:

torch.set_default_tensor_type('torch.cuda.FloatTensor') # Enable global GPU
run TF execution time Pytorch execution time
1 43.95948519900048 31.157700912997825
2 45.31296913999904 30.35345944400251
3 44.46277614200153 29.964271681001264
mean 44.57841016033368 30.491810679333867

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.benchmark = True
run TF execution time Pytorch execution time
1 44.35301048200199 29.58910515900061
2 44.9747937630018 29.630526167995413
3 45.64101090400072 29.92008007500408
mean 44.98960504966817 29.92008007500408

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.fastest = True
run TF execution time Pytorch execution time
1 44.429974582002615 31.8192962969988
2 44.92464860399923 30.395992316996853
3 44.754384061001474 30.31036730899359
mean 44.754384061001474 30.84188530766308

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 43.431389666000086 15.98217932800003
2 43.28244902200004 15.648057652000034
3 41.674334185000134 15.587252373000183
mean 42.79605762433342 15.739163117666749

Rsample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 49.16316856100093 17.141713114000595
2 49.98035491799965 17.08107351599756
3 50.76262897200286 17.62022704699848
mean 49.96871748366781 17.62022704699848

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 48.19858287099851 32.428321216997574
2 47.84914048400242 32.55513772999984
3 48.16133704599997 32.596067703998415
mean 48.069686800333635 32.52650888366528

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run TF execution time Pytorch execution time
1 47.682932835006795 32.69532821299799
2 48.566335363997496 33.00475709999591
3 49.04837610800314 33.51344140100264
mean 48.43254810233581 33.07117557133218

Let's enable the cuDNN benchmark flag.

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 47.537661023998226 33.475586044995
2 48.195051800998044 32.84605344099691
3 47.35034869299852 32.90194606300065
mean 47.6943538393316 33.07452851633085

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 47.47512355899994 17.723089199000015
2 47.105311723999876 17.64742597999998
3 48.08153719200027 17.38713361100008
mean 47.55399082500003 17.585882930000025

Conclusion

From these results, we can see that PyTorch is significantly faster than TensorFlow at sampling actions. Further, we see that using GPU doesn't affect the execution time of Tensorflow while for PyTorch, the execution time gets longer. The execution time, however, stays significantly lower for the PyTorch version even when GPU is enabled. This execution time does not improve while enabling the cuDNN benchmark and fastest flags.

Compare Log probability execution speed

Now, in addition to this sampling operation, let's also calculate the log_probabilities of the sampled action.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 50.36903019700185 47.872389495998505
2 51.30757960200208 46.29049172399755
3 49.60291533399868 47.005423079997854
mean 50.42650837766754 47.0561014333313

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 48.91270494000128 100.27575907999926
2 47.13373410799977 98.49456620199999
3 46.99059675100216 99.80839579399981
mean 47.67901193300107 99.52624035866636

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run TF execution time Pytorch execution time
1 48.4585376219984 101.22743981500389
2 45.78058913299992 97.0079226279995
3 47.77440226700128 99.20072890699521
mean 48.68052208800024 99.14536378333287

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 48.1281833979956 101.55883601200185
2 47.37521640000341 100.41289296699688
3 47.830623365000065 99.11197711800196
mean 47.77800772099969 100.3612353656669

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 47.37839867199909 48.91452547500012
2 47.48065972999757 46.51999894999972
3 46.97186975300065 46.97186975300065
mean 47.27697605166577 47.468798059333494

Conclusion

From this, we can see that calculating the log_probability from a sampled action is slightly faster in Pytorch when it is done on the CPU. The execution time increases in the Pytorch when the calculations are done on the GPU. For TensorFlow, the execution time decreases when the calculations are performed on the GPU. Additionally, this result also suggests that the stand-alone log_probability calculation is slower in Pytorch than in TensorFlow as the sampling operation is faster in Pytorch, but the total time is the same.

Compare stand-alone Log probability function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 45.78058913299992 38.37326285200015
2 44.57832863599924 38.93101751900031
3 45.58136013500007 38.56264149399976
mean 45.31342596799974 38.62230728833341

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 48.07486991299993 88.60940667599971
2 47.214407829000265 86.43189353300022
3 47.78040611400047 85.64942697800052
mean 47.68989461866689 86.89690906233348

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 45.30441331600014 41.29998277499999
2 45.31698288100051 38.515963581000506
3 44.346509671000604 39.96837626400065
mean 44.98930195600042 39.92810754000038

Conclusion

From the above results, we can see that calculating the LogProbability is slightly faster in Pytorch when using CPU mode. When GPU mode is used Pytorch becomes slower. This result is expected from the results above. It also shows that the contribution of the sampling action to the execution time is smaller than the log_probability.

Compare Log probability + Squashing execution speed

Now in addition to calculating the log_probability of the sampled action lets also squash the computed action while applying a correction for this squashing on the calculated log_probabilities.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 50.94963947499855 79.64743485200233
2 50.60395075799897 78.01757510300013
3 50.43615571299961 80.14881448499727
mean 50.66324864866571 80.14881448499727

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 50.378254114999436 151.5695584310015
2 48.73778436000066 153.44167755200033
3 48.58900336899751 150.91326868199758
mean 49.2350139479992 151.97483488833313

GPU GLOBAL cudnn.benchmark

Let's enable the benchmark flag.

run TF execution time Pytorch execution time
1 49.12833458800014 152.07146545599971
2 50.353105712005345 151.1550173670039
3 48.28671533699526 154.40869215199928
mean 49.256051879000246 152.54505832500095

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 48.219657355999516 156.77407888899324
2 47.961540133997914 152.9459018090056
3 47.866569441997854 153.34349197799747
mean 48.0159223106651 154.35449089199878

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 48.151476297000045 80.33919906399933
2 47.25795930499953 77.96675552500164
3 47.56391130200063 79.42569091699988
mean 47.657782301333405 79.24388183533362

Conclusion

The results above show that the Pytorch version becomes slower than the TensorFlow version when we add the Squash operation. Further, it again shows that enabling the GPU increases the execution time for the Pytorch version while it doesn't change anything for the TensorFlow version.

Compare stand-alone squashing function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 44.14887383600035 70.58557372399991
2 47.06251954599975 70.10898442500002
3 44.60933404100069 67.97248776999913
mean 45.27357580766693 69.5556819729997

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 47.64849690200026 138.98152366400063
2 49.2268215790009 137.38257430200065
3 47.322883758999524 139.86679378100052
mean 48.06606741333356 138.74363058233394

Enabling the cuDNN benchmark or fastest flags does not change the speed. TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 47.66489131900016 69.29213760200037
2 49.14248128899999 68.04381407999972
3 48.16409693700007 67.40893307699844
mean 48.32382318166674 68.24829491966618

Conclusion

From the results above we see that the Squashing operation is indeed the bottleneck which causes the Pytorch version to be slower than the Tensorflow version. We, therefore, have to look if we can speed up this squashing in the Pytorch version.

Compare linear layer forward pass time

Now let's compare a forward pass to the linear layers of the networks.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 375.18870995100224 320.4068507410011
2 369.548168835001 311.17305828499957
3 370.5892158940005 313.39369041700047
mean 373.655545265335 314.9911998143337

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 229.82201903599343 223.34556611099833
2 224.82652258500457 223.9562453960025
3 227.6470035270031 222.9704135100037
mean 227.43184838266703 223.4240750056682

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run TF execution time Pytorch execution time
1 238.45478896200075 224.40128824699786
2 235.52634897700045 224.19951804900484
3 230.21658990500146 225.0826409250003
mean 234.73257594800089 224.56114907366768

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 231.63010256599955 219.30785131300217
2 223.39881358000275 225.78243027799908
3 228.4898673600037 228.4898673600037
mean 227.839594502002 224.52671631700164

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run TF execution time Pytorch execution time
1 234.0862072489981 223.88553328599664
2 231.2203368430055 223.88553328599664
3 234.37278411599982 224.01825483499852
mean 233.22644273600113 224.01825483499852

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run TF execution time Pytorch execution time
1 225.9497257460025 276.2206137460016
2 231.38425711599848 273.4203483720048
3 230.72190967400093 274.2106625689994
mean 229.35196417866732 274.61720822900196

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 205.74492095799724 313.0814373590001
2 223.43292250000013 333.1711495269992
3 230.10195052199924 324.7172065039995
mean 219.75993132666554 323.6565977966663

Enabling the cuDNN benchmark or fastest flags does not change the speed.

Conclusion

From these results, we see that the forward pass is faster in the Pytorch version. The only case where the Tensorflow version is faster is when we write the data back to the CPU with every pass. This is as expected as this operation is computationally expensive, and the Tensorflow version keeps the data on the GPU.

Compare full networks forward pass time

Now let's compare a forward pass through the whole networks, which means including the sampling action, log probability and squashing action.

CPU GLOBAL

run TF execution time Pytorch execution time
1 411.22191030799877 536.3658260090015
2 412.4232401260015 513.8089201249968
3 414.8655814080048 525.9684370820032
mean 412.8369106140017 525.3810610720005

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 336.8346674320055 432.64605384600145
2 337.65133418299956 432.2832743270046
3 335.77879138099524 428.30122067600314
mean 336.7549309986668 431.0768496163364

Enabling benchmark and fastest does not change the speed.

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run TF execution time Pytorch execution time
1 336.0141714460042 414.5644553680031
2 339.7367062159901 418.4468067179987
3 343.35170658299467 427.59492996100744
mean 339.7008614149963 420.20206401566975

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run TF execution time Pytorch execution time
1 330.5518290850014 487.8034812370024
2 341.3577948439997 487.35418161000007
3 343.21883232100026 496.9929247519999
mean 338.3761520833338 490.7168625330008

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 335.62528117199963 512.721542063
2 337.9906162549996 515.8002393119996
3 341.41880026500075 526.5174871430008
mean 338.34489923066667 518.3464228393335

Conclusion

From the results above, we see that for the full forward pass Tensorflow is significantly faster. As explained above, this is probably caused by the slower execution speed of the squash operation.

Overal conclusion

From the results above we can conclude the following:

rickstaa commented 4 years ago

Squashing operation speed improvement. Investigation

Let's examine what could cause the squashing operation to be slower in the Pytorch version.

Are the two solutions equal

I double-checked, and both the Pytorch and Tensorflow solutions are equal. The only difference can be found in the way the two frameworks calculate the log_probabiliy of the sampled actions that come from the SquashedGaussian.

Pytorch

https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/9a359f62a9214e828823c437697e0a845a3b3d3f/LAC_CLEANED_TORCH/gaussian_actor.py#L113-L125

In PyTorch, these log probabilities are calculated by calculating them from the non-squashed distribution and then applying a correction for the Tanh squashing.

Tensorflow

https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/9a359f62a9214e828823c437697e0a845a3b3d3f/LAC_TF2_CLEANED_EAGER_SPEEDUP/squash_bijector.py#L18-L27

Tensorflow uses the bijectors and TransformedDistribution modules to achieve the same.

Where does the speed difference come from

As seen above the two Frameworks use the same formulas for calculating the squashed log probabilities.

Check individual components

There are two possible candidates which cause the extra execution time for the Pytorch version:

Log_prob calculation

Lets first get a baseline by measuring the log_prob execution time. In these tests, Pytorch is using CPU while Tensorflow uses the default device (GPU).

run TF execution time Pytorch execution time
1 130.1151521205902 96.50015997886658
2 133.6686074733734 96.55137491226196
3 129.1776819229126 95.73350787162781
mean 130.87938650449118 96.26168092091878

Log_prob + action squash

run TF execution time Pytorch execution time
1 130.4717721939087 107.03060293197632
2 130.64691376686096 105.87263202667236
3 134.26840901374817 105.66786861419678
mean 131.79569832483926 106.19036785761516

Log_prop + action squash + distribution squash

run TF execution time Pytorch execution time
1 133.60435914993286 162.03667068481445
2 130.88181281089783 158.7844672203064
3 129.7918701171875 160.77405977249146
mean 131.42601402600607 160.5317325592041

Conclusion

From these results, we see that the biggest execution time drop is found when we squash the distribution. To speed up the Pytorch version we, therefore, have to look if we can make this operation more efficient. This speed difference is strange as the formula which is used for this is the same between the Pytorch and Tensorflow version:

Tensorflow version https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/73ad7d206a7d34438f9cf2883e2dee2c3edcded7/LAC_TF2_CLEANED_EAGER_SPEEDUP/squash_bijector.py#L26

Pytorch version https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/73ad7d206a7d34438f9cf2883e2dee2c3edcded7/LAC_CLEANED_TORCH_SPEEDUP/gaussian_actor.py#L132-L134

As a result the speed difference can be caused by:

Deeper investigation

Softplus method

Python code

rickstaa commented 3 years ago

Last speed investigation before transferring to MLC

Let's perform one last investigation on how fast each version is (with the final code) before transferring to the MLC repository. To do this, I perform 0.8e4 steps in the oscillator environment.

Quick tests

LAC_ORIGINAL

CPU:


1: 37.805219650268555 s (8000 steps - seed 0)
2: 36.3125 s (8000 steps - seed 65453)
3: 36.72363305091858 s (8000 steps - seed 3453) 
mean: 37 s

GPU:

The old version is using TensorFlow 1.13.0, which only supports Cuda 10.0. Cuda 10.0 is, however not supported on the os I'm using. I, therefore, had to test this on my old pc.

1: 44.91222882270813 s (8000 steps - seed 0)
2: 45.24729943275452 s (8000 steps - seed 65453)
3: 44.684839487075806 s (8000 steps - seed 3453) 
mean: 37 s

LAC_TORCH

CPU:

Seeds:
1: 65.01357102394104 s (8000 steps - seed 0)
2: 68.49877715110779 s (8000 steps - seed 65453)
3: 67.3913779258728 s (8000 steps - seed 3453)
mean: 67s

GPU:

Seeds:
1: 65.51777911186218 s (8000 steps - seed 0)
2: 65.25916171073914 s (8000 steps - seed 65453)
3: 65.9152410030365 s (8000 steps - seed 3453)
mean: 65 s

LAC_TF2

CPU:

1: 40.394707918167114 s (8000 steps - seed 0)
2: 52.60241150856018 s (8000 steps - seed 65453)
3: 71.55074095726013 s (8000 steps - seed 3453)
mean: 55 s

GPU:

1: 65.01357102394104 s (8000 steps - seed 0)
2: 66.7179274559021 s (8000 steps - seed 65453)
3: 74.95780944824219 s (8000 steps - seed 3453)
mean: 69 s

Longer run

When we do a longer run of 1e5 steps, the difference in speed becomes more clear:

LAC_TORCH (GPU): 845.13680768013 s LAC_TORCH (CPU): 924.6164543628693 s LAC_TF2: 573.4957985877991 s

We see that training the script takes nearly 1.47 times as long when using Pytorch than when we use TF2.

Investigation

When we look at the spyder reports, we see that the difference is mainly caused by the step and learn/call function.

image

This is, however very strange as the step function is the same in both versions. I, therefore, think the profiler was unable to correctly measure the time of the individual components in the TensorFlow case (which compiles the code that is wrapped by the tf.function decorator). Let's try one more time using function trace.

Conclusion

For this, we can see that the CPU speed really depends on the random seed we use. The GPU speed is more stable. Further, we see that overall the old version is a little bit faster than the new version. It further looks like the TF2 version is a little bit faster than the Torch version. This is not strange as TensorFlow is using complied code (using the tf.function) decorator where PyTorch is using python code of which the low-level components are written in c++. If we disable the tf.function decorator the TensorFlow code becomes 10 times slower than Pytorch. We can however not be 100% certain as we did not use more seeds. The torch version when on GPU is, however, faster than the TensorFlow version when it is using GPU. This is as expected as for the PyTorch version I only put the big networks on the GPU where the TensorFlow version tries to do more things on the GPU. From the reports above we can also see that the biggest speed difference is found in the forward and backward passes.

A way to speed this up would use the torch script wrapper. This currently is not yet supported as torch script wrapper does not support the distributions that are used in the gaussian actor (see https://github.com/pytorch/pytorch/issues/18094 and https://github.com/pytorch/pytorch/issues/29843).