Closed rickstaa closed 3 years ago
Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.
When looking at the profiler reports, we can see there are several components that can cause the speed difference:
Lets therefore try to investigate the speed difference into this individual components.
As a final step lets compare the speed of the Forward/backward pass and the distributions sample actions. For this, we can create some small dummy timing scripts. These scripts can be found in the sandbox/speed_comparison
folder.
Let's compare the speed of the sampling action in the Gaussian action between Pytorch and Tensorflow. For this, we use the timeit_sample_speed_compare.py
script. In this script, we take 5e5
samples/rsamples from the normal distribution.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.269880842999555 | 15.277080029998615 |
2 | 46.73242047799795 | 15.733279996999045 |
3 | 47.41759105299934 | 15.471710162000818 |
mean | 47.13996412466562 | 15.494023396332826 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU). In PyTorch, this is done using the following command:
torch.set_default_tensor_type('torch.cuda.FloatTensor') # Enable global GPU
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 43.95948519900048 | 31.157700912997825 |
2 | 45.31296913999904 | 30.35345944400251 |
3 | 44.46277614200153 | 29.964271681001264 |
mean | 44.57841016033368 | 30.491810679333867 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag. In PyTorch, this is done using the following command:
torch.backends.cudnn.benchmark = True
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 44.35301048200199 | 29.58910515900061 |
2 | 44.9747937630018 | 29.630526167995413 |
3 | 45.64101090400072 | 29.92008007500408 |
mean | 44.98960504966817 | 29.92008007500408 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag. In PyTorch, this is done using the following command:
torch.backends.cudnn.fastest = True
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 44.429974582002615 | 31.8192962969988 |
2 | 44.92464860399923 | 30.395992316996853 |
3 | 44.754384061001474 | 30.31036730899359 |
mean | 44.754384061001474 | 30.84188530766308 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 43.431389666000086 | 15.98217932800003 |
2 | 43.28244902200004 | 15.648057652000034 |
3 | 41.674334185000134 | 15.587252373000183 |
mean | 42.79605762433342 | 15.739163117666749 |
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 49.16316856100093 | 17.141713114000595 |
2 | 49.98035491799965 | 17.08107351599756 |
3 | 50.76262897200286 | 17.62022704699848 |
mean | 49.96871748366781 | 17.62022704699848 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.19858287099851 | 32.428321216997574 |
2 | 47.84914048400242 | 32.55513772999984 |
3 | 48.16133704599997 | 32.596067703998415 |
mean | 48.069686800333635 | 32.52650888366528 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.682932835006795 | 32.69532821299799 |
2 | 48.566335363997496 | 33.00475709999591 |
3 | 49.04837610800314 | 33.51344140100264 |
mean | 48.43254810233581 | 33.07117557133218 |
Let's enable the cuDNN benchmark flag.
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.537661023998226 | 33.475586044995 |
2 | 48.195051800998044 | 32.84605344099691 |
3 | 47.35034869299852 | 32.90194606300065 |
mean | 47.6943538393316 | 33.07452851633085 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.47512355899994 | 17.723089199000015 |
2 | 47.105311723999876 | 17.64742597999998 |
3 | 48.08153719200027 | 17.38713361100008 |
mean | 47.55399082500003 | 17.585882930000025 |
From these results, we can see that PyTorch is significantly faster than TensorFlow at sampling actions. Further, we see that using GPU doesn't affect the execution time of Tensorflow while for PyTorch, the execution time gets longer. The execution time, however, stays significantly lower for the PyTorch version even when GPU is enabled. This execution time does not improve while enabling the cuDNN benchmark and fastest flags.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 45.78058913299992 | 38.37326285200015 |
2 | 44.57832863599924 | 38.93101751900031 |
3 | 45.58136013500007 | 38.56264149399976 |
mean | 45.31342596799974 | 38.62230728833341 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.07486991299993 | 88.60940667599971 |
2 | 47.214407829000265 | 86.43189353300022 |
3 | 47.78040611400047 | 85.64942697800052 |
mean | 47.68989461866689 | 86.89690906233348 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 45.30441331600014 | 41.29998277499999 |
2 | 45.31698288100051 | 38.515963581000506 |
3 | 44.346509671000604 | 39.96837626400065 |
mean | 44.98930195600042 | 39.92810754000038 |
From the above results, we can see that calculating the LogProbability is slightly faster in Pytorch when using CPU mode. When GPU mode is used, Pytorch becomes slower. This result is expected from the results above. It also shows that the contribution of the sampling action to the execution time is smaller than the log_probability.
Now in addition to calculating the log_probability of the sampled action lets also squash the computed action while applying a correction for this squashing on the calculated log_probabilities.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 50.94963947499855 | 79.64743485200233 |
2 | 50.60395075799897 | 78.01757510300013 |
3 | 50.43615571299961 | 80.14881448499727 |
mean | 50.66324864866571 | 80.14881448499727 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 50.378254114999436 | 151.5695584310015 |
2 | 48.73778436000066 | 153.44167755200033 |
3 | 48.58900336899751 | 150.91326868199758 |
mean | 49.2350139479992 | 151.97483488833313 |
GPU GLOBAL cudnn.benchmark
Let's enable the benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 49.12833458800014 | 152.07146545599971 |
2 | 50.353105712005345 | 151.1550173670039 |
3 | 48.28671533699526 | 154.40869215199928 |
mean | 49.256051879000246 | 152.54505832500095 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.219657355999516 | 156.77407888899324 |
2 | 47.961540133997914 | 152.9459018090056 |
3 | 47.866569441997854 | 153.34349197799747 |
mean | 48.0159223106651 | 154.35449089199878 |
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 48.151476297000045 | 80.33919906399933 |
2 | 47.25795930499953 | 77.96675552500164 |
3 | 47.56391130200063 | 79.42569091699988 |
mean | 47.657782301333405 | 79.24388183533362 |
The results above show that the Pytorch version becomes slower than the TensorFlow version when we add the Squash operation. Further, it again shows that enabling the GPU increases the execution time for the Pytorch version while it doesn't change anything for the TensorFlow version.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 44.14887383600035 | 70.58557372399991 |
2 | 47.06251954599975 | 70.10898442500002 |
3 | 44.60933404100069 | 67.97248776999913 |
mean | 45.27357580766693 | 69.5556819729997 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.64849690200026 | 138.98152366400063 |
2 | 49.2268215790009 | 137.38257430200065 |
3 | 47.322883758999524 | 139.86679378100052 |
mean | 48.06606741333356 | 138.74363058233394 |
Enabling the cuDNN benchmark or fastest flags does not change the speed. TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 47.66489131900016 | 69.29213760200037 |
2 | 49.14248128899999 | 68.04381407999972 |
3 | 48.16409693700007 | 67.40893307699844 |
mean | 48.32382318166674 | 68.24829491966618 |
From the results above we see that the Squashing operation is indeed the bottleneck which causes the Pytorch version to be slower than the Tensorflow version. We, therefore, have to look if we can speed up this squashing in the Pytorch version.
Now let's compare a forward pass to the linear layers of the networks.
CPU GLOBAL
Let's disable GPU for both TensorFlow and Pytorch.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 375.18870995100224 | 320.4068507410011 |
2 | 369.548168835001 | 311.17305828499957 |
3 | 370.5892158940005 | 313.39369041700047 |
mean | 373.655545265335 | 314.9911998143337 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 229.82201903599343 | 223.34556611099833 |
2 | 224.82652258500457 | 223.9562453960025 |
3 | 227.6470035270031 | 222.9704135100037 |
mean | 227.43184838266703 | 223.4240750056682 |
GPU GLOBAL cudnn.benchmark
Let's enable the cuDNN benchmark flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 238.45478896200075 | 224.40128824699786 |
2 | 235.52634897700045 | 224.19951804900484 |
3 | 230.21658990500146 | 225.0826409250003 |
mean | 234.73257594800089 | 224.56114907366768 |
GPU GLOBAL cudnn.fastest
Let's enable the cuDNN fastest flag.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 231.63010256599955 | 219.30785131300217 |
2 | 223.39881358000275 | 225.78243027799908 |
3 | 228.4898673600037 | 228.4898673600037 |
mean | 227.839594502002 | 224.52671631700164 |
GPU customised (Without writing back to CPU)
Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 234.0862072489981 | 223.88553328599664 |
2 | 231.2203368430055 | 223.88553328599664 |
3 | 234.37278411599982 | 224.01825483499852 |
mean | 233.22644273600113 | 224.01825483499852 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
GPU customised (With writing back to CPU)
Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 225.9497257460025 | 276.2206137460016 |
2 | 231.38425711599848 | 273.4203483720048 |
3 | 230.72190967400093 | 274.2106625689994 |
mean | 229.35196417866732 | 274.61720822900196 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 205.74492095799724 | 313.0814373590001 |
2 | 223.43292250000013 | 333.1711495269992 |
3 | 230.10195052199924 | 324.7172065039995 |
mean | 219.75993132666554 | 323.6565977966663 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
From these results, we see that the forward pass is faster in the Pytorch version. The only case where the Tensorflow version is faster is when we write the data back to the CPU with every pass. This is as expected as this operation is computationally expensive, and the Tensorflow version keeps the data on the GPU.
Now let's compare a forward pass through the whole networks, which means including the sampling action, log probability and squashing action.
CPU GLOBAL
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 411.22191030799877 | 536.3658260090015 |
2 | 412.4232401260015 | 513.8089201249968 |
3 | 414.8655814080048 | 525.9684370820032 |
mean | 412.8369106140017 | 525.3810610720005 |
GPU GLOBAL
Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 336.8346674320055 | 432.64605384600145 |
2 | 337.65133418299956 | 432.2832743270046 |
3 | 335.77879138099524 | 428.30122067600314 |
mean | 336.7549309986668 | 431.0768496163364 |
Enabling benchmark and fastest does not change the speed.
GPU customised (Without writing back to CPU)
Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 336.0141714460042 | 414.5644553680031 |
2 | 339.7367062159901 | 418.4468067179987 |
3 | 343.35170658299467 | 427.59492996100744 |
mean | 339.7008614149963 | 420.20206401566975 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
GPU customised (With writing back to CPU)
Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 330.5518290850014 | 487.8034812370024 |
2 | 341.3577948439997 | 487.35418161000007 |
3 | 343.21883232100026 | 496.9929247519999 |
mean | 338.3761520833338 | 490.7168625330008 |
Enabling the cuDNN benchmark or fastest flags does not change the speed.
TF GPU, PyTorch CPU
Let's disable GPU for PyTorch and keep it enabled in TensorFlow.
run | TF execution time | Pytorch execution time |
---|---|---|
1 | 335.62528117199963 | 512.721542063 |
2 | 337.9906162549996 | 515.8002393119996 |
3 | 341.41880026500075 | 526.5174871430008 |
mean | 338.34489923066667 | 518.3464228393335 |
From the results above, we see that for the full forward pass Tensorflow is significantly faster. As explained above, this is probably caused by the slower execution speed of the squash operation.
From the results above we can conclude the following:
tf.function
or other inefficient functions in the Pytorch learn
method.Closed due to inactivity.
As mentioned in this issue, the Pytorch version of the LAC algorithm is about 46% slower than the Tensorflow version. The main cause for this difference appeared to be in the Forward and Backward passes through the Gaussian actor network. In this issue, I will investigate what causes this difference.