rickstaa / torch-tf2-lac-speed-compare

Repository used to investigate Pytorch TF2.x speed differences.
0 stars 1 forks source link

Investigate the observed speed difference in the LAC algorithm between Pytorch and Tensorflow 2.x #1

Closed rickstaa closed 3 years ago

rickstaa commented 4 years ago

As mentioned in this issue, the Pytorch version of the LAC algorithm is about 46% slower than the Tensorflow version. The main cause for this difference appeared to be in the Forward and Backward passes through the Gaussian actor network. In this issue, I will investigate what causes this difference.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

rickstaa commented 3 years ago

When looking at the profiler reports, we can see there are several components that can cause the speed difference:

Lets therefore try to investigate the speed difference into this individual components.

Compare Forward/Backward and sample between PyTorch and TensorFlow

As a final step lets compare the speed of the Forward/backward pass and the distributions sample actions. For this, we can create some small dummy timing scripts. These scripts can be found in the sandbox/speed_comparison folder.

Compare sampling execution speed

Let's compare the speed of the sampling action in the Gaussian action between Pytorch and Tensorflow. For this, we use the timeit_sample_speed_compare.py script. In this script, we take 5e5 samples/rsamples from the normal distribution.

Sample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 47.269880842999555 15.277080029998615
2 46.73242047799795 15.733279996999045
3 47.41759105299934 15.471710162000818
mean 47.13996412466562 15.494023396332826

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU). In PyTorch, this is done using the following command:

torch.set_default_tensor_type('torch.cuda.FloatTensor') # Enable global GPU
run TF execution time Pytorch execution time
1 43.95948519900048 31.157700912997825
2 45.31296913999904 30.35345944400251
3 44.46277614200153 29.964271681001264
mean 44.57841016033368 30.491810679333867

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.benchmark = True
run TF execution time Pytorch execution time
1 44.35301048200199 29.58910515900061
2 44.9747937630018 29.630526167995413
3 45.64101090400072 29.92008007500408
mean 44.98960504966817 29.92008007500408

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.fastest = True
run TF execution time Pytorch execution time
1 44.429974582002615 31.8192962969988
2 44.92464860399923 30.395992316996853
3 44.754384061001474 30.31036730899359
mean 44.754384061001474 30.84188530766308

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 43.431389666000086 15.98217932800003
2 43.28244902200004 15.648057652000034
3 41.674334185000134 15.587252373000183
mean 42.79605762433342 15.739163117666749

Rsample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 49.16316856100093 17.141713114000595
2 49.98035491799965 17.08107351599756
3 50.76262897200286 17.62022704699848
mean 49.96871748366781 17.62022704699848

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 48.19858287099851 32.428321216997574
2 47.84914048400242 32.55513772999984
3 48.16133704599997 32.596067703998415
mean 48.069686800333635 32.52650888366528

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run TF execution time Pytorch execution time
1 47.682932835006795 32.69532821299799
2 48.566335363997496 33.00475709999591
3 49.04837610800314 33.51344140100264
mean 48.43254810233581 33.07117557133218

Let's enable the cuDNN benchmark flag.

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 47.537661023998226 33.475586044995
2 48.195051800998044 32.84605344099691
3 47.35034869299852 32.90194606300065
mean 47.6943538393316 33.07452851633085

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 47.47512355899994 17.723089199000015
2 47.105311723999876 17.64742597999998
3 48.08153719200027 17.38713361100008
mean 47.55399082500003 17.585882930000025

Conclusion

From these results, we can see that PyTorch is significantly faster than TensorFlow at sampling actions. Further, we see that using GPU doesn't affect the execution time of Tensorflow while for PyTorch, the execution time gets longer. The execution time, however, stays significantly lower for the PyTorch version even when GPU is enabled. This execution time does not improve while enabling the cuDNN benchmark and fastest flags.

Compare stand-alone Log probability function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 45.78058913299992 38.37326285200015
2 44.57832863599924 38.93101751900031
3 45.58136013500007 38.56264149399976
mean 45.31342596799974 38.62230728833341

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 48.07486991299993 88.60940667599971
2 47.214407829000265 86.43189353300022
3 47.78040611400047 85.64942697800052
mean 47.68989461866689 86.89690906233348

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 45.30441331600014 41.29998277499999
2 45.31698288100051 38.515963581000506
3 44.346509671000604 39.96837626400065
mean 44.98930195600042 39.92810754000038

Conclusion

From the above results, we can see that calculating the LogProbability is slightly faster in Pytorch when using CPU mode. When GPU mode is used, Pytorch becomes slower. This result is expected from the results above. It also shows that the contribution of the sampling action to the execution time is smaller than the log_probability.

Compare Log probability + Squashing execution speed

Now in addition to calculating the log_probability of the sampled action lets also squash the computed action while applying a correction for this squashing on the calculated log_probabilities.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 50.94963947499855 79.64743485200233
2 50.60395075799897 78.01757510300013
3 50.43615571299961 80.14881448499727
mean 50.66324864866571 80.14881448499727

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 50.378254114999436 151.5695584310015
2 48.73778436000066 153.44167755200033
3 48.58900336899751 150.91326868199758
mean 49.2350139479992 151.97483488833313

GPU GLOBAL cudnn.benchmark

Let's enable the benchmark flag.

run TF execution time Pytorch execution time
1 49.12833458800014 152.07146545599971
2 50.353105712005345 151.1550173670039
3 48.28671533699526 154.40869215199928
mean 49.256051879000246 152.54505832500095

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 48.219657355999516 156.77407888899324
2 47.961540133997914 152.9459018090056
3 47.866569441997854 153.34349197799747
mean 48.0159223106651 154.35449089199878

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 48.151476297000045 80.33919906399933
2 47.25795930499953 77.96675552500164
3 47.56391130200063 79.42569091699988
mean 47.657782301333405 79.24388183533362

Conclusion

The results above show that the Pytorch version becomes slower than the TensorFlow version when we add the Squash operation. Further, it again shows that enabling the GPU increases the execution time for the Pytorch version while it doesn't change anything for the TensorFlow version.

Compare stand-alone squashing function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 44.14887383600035 70.58557372399991
2 47.06251954599975 70.10898442500002
3 44.60933404100069 67.97248776999913
mean 45.27357580766693 69.5556819729997

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 47.64849690200026 138.98152366400063
2 49.2268215790009 137.38257430200065
3 47.322883758999524 139.86679378100052
mean 48.06606741333356 138.74363058233394

Enabling the cuDNN benchmark or fastest flags does not change the speed. TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 47.66489131900016 69.29213760200037
2 49.14248128899999 68.04381407999972
3 48.16409693700007 67.40893307699844
mean 48.32382318166674 68.24829491966618

Conclusion

From the results above we see that the Squashing operation is indeed the bottleneck which causes the Pytorch version to be slower than the Tensorflow version. We, therefore, have to look if we can speed up this squashing in the Pytorch version.

Compare linear layer forward pass time

Now let's compare a forward pass to the linear layers of the networks.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run TF execution time Pytorch execution time
1 375.18870995100224 320.4068507410011
2 369.548168835001 311.17305828499957
3 370.5892158940005 313.39369041700047
mean 373.655545265335 314.9911998143337

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 229.82201903599343 223.34556611099833
2 224.82652258500457 223.9562453960025
3 227.6470035270031 222.9704135100037
mean 227.43184838266703 223.4240750056682

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run TF execution time Pytorch execution time
1 238.45478896200075 224.40128824699786
2 235.52634897700045 224.19951804900484
3 230.21658990500146 225.0826409250003
mean 234.73257594800089 224.56114907366768

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run TF execution time Pytorch execution time
1 231.63010256599955 219.30785131300217
2 223.39881358000275 225.78243027799908
3 228.4898673600037 228.4898673600037
mean 227.839594502002 224.52671631700164

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run TF execution time Pytorch execution time
1 234.0862072489981 223.88553328599664
2 231.2203368430055 223.88553328599664
3 234.37278411599982 224.01825483499852
mean 233.22644273600113 224.01825483499852

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run TF execution time Pytorch execution time
1 225.9497257460025 276.2206137460016
2 231.38425711599848 273.4203483720048
3 230.72190967400093 274.2106625689994
mean 229.35196417866732 274.61720822900196

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 205.74492095799724 313.0814373590001
2 223.43292250000013 333.1711495269992
3 230.10195052199924 324.7172065039995
mean 219.75993132666554 323.6565977966663

Enabling the cuDNN benchmark or fastest flags does not change the speed.

Conclusion

From these results, we see that the forward pass is faster in the Pytorch version. The only case where the Tensorflow version is faster is when we write the data back to the CPU with every pass. This is as expected as this operation is computationally expensive, and the Tensorflow version keeps the data on the GPU.

Compare full networks forward pass time

Now let's compare a forward pass through the whole networks, which means including the sampling action, log probability and squashing action.

CPU GLOBAL

run TF execution time Pytorch execution time
1 411.22191030799877 536.3658260090015
2 412.4232401260015 513.8089201249968
3 414.8655814080048 525.9684370820032
mean 412.8369106140017 525.3810610720005

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run TF execution time Pytorch execution time
1 336.8346674320055 432.64605384600145
2 337.65133418299956 432.2832743270046
3 335.77879138099524 428.30122067600314
mean 336.7549309986668 431.0768496163364

Enabling benchmark and fastest does not change the speed.

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run TF execution time Pytorch execution time
1 336.0141714460042 414.5644553680031
2 339.7367062159901 418.4468067179987
3 343.35170658299467 427.59492996100744
mean 339.7008614149963 420.20206401566975

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run TF execution time Pytorch execution time
1 330.5518290850014 487.8034812370024
2 341.3577948439997 487.35418161000007
3 343.21883232100026 496.9929247519999
mean 338.3761520833338 490.7168625330008

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run TF execution time Pytorch execution time
1 335.62528117199963 512.721542063
2 337.9906162549996 515.8002393119996
3 341.41880026500075 526.5174871430008
mean 338.34489923066667 518.3464228393335

Conclusion

From the results above, we see that for the full forward pass Tensorflow is significantly faster. As explained above, this is probably caused by the slower execution speed of the squash operation.

Overal conclusion

From the results above we can conclude the following:

rickstaa commented 3 years ago

Closed due to inactivity.