Investigate the observed speed difference in the LAC algorithm between Pytorch and Tensorflow 2.x

rickstaa commented 4 years ago

As mentioned in this issue, the Pytorch version of the LAC algorithm is about 46% slower than the Tensorflow version. The main cause for this difference appeared to be in the Forward and Backward passes through the Gaussian actor network. In this issue, I will investigate what causes this difference.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

rickstaa commented 3 years ago

When looking at the profiler reports, we can see there are several components that can cause the speed difference:

The sampling operation.
The log probability operation.
The squahsing operation.

Lets therefore try to investigate the speed difference into this individual components.

Compare Forward/Backward and sample between PyTorch and TensorFlow

As a final step lets compare the speed of the Forward/backward pass and the distributions sample actions. For this, we can create some small dummy timing scripts. These scripts can be found in the sandbox/speed_comparison folder.

Compare sampling execution speed

Let's compare the speed of the sampling action in the Gaussian action between Pytorch and Tensorflow. For this, we use the timeit_sample_speed_compare.py script. In this script, we take 5e5 samples/rsamples from the normal distribution.

Sample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	47.269880842999555	15.277080029998615
2	46.73242047799795	15.733279996999045
3	47.41759105299934	15.471710162000818
mean	47.13996412466562	15.494023396332826

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU). In PyTorch, this is done using the following command:

torch.set_default_tensor_type('torch.cuda.FloatTensor') # Enable global GPU

run	TF execution time	Pytorch execution time
1	43.95948519900048	31.157700912997825
2	45.31296913999904	30.35345944400251
3	44.46277614200153	29.964271681001264
mean	44.57841016033368	30.491810679333867

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.benchmark = True

run	TF execution time	Pytorch execution time
1	44.35301048200199	29.58910515900061
2	44.9747937630018	29.630526167995413
3	45.64101090400072	29.92008007500408
mean	44.98960504966817	29.92008007500408

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.fastest = True

run	TF execution time	Pytorch execution time
1	44.429974582002615	31.8192962969988
2	44.92464860399923	30.395992316996853
3	44.754384061001474	30.31036730899359
mean	44.754384061001474	30.84188530766308

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	43.431389666000086	15.98217932800003
2	43.28244902200004	15.648057652000034
3	41.674334185000134	15.587252373000183
mean	42.79605762433342	15.739163117666749

Rsample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	49.16316856100093	17.141713114000595
2	49.98035491799965	17.08107351599756
3	50.76262897200286	17.62022704699848
mean	49.96871748366781	17.62022704699848

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	48.19858287099851	32.428321216997574
2	47.84914048400242	32.55513772999984
3	48.16133704599997	32.596067703998415
mean	48.069686800333635	32.52650888366528

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run	TF execution time	Pytorch execution time
1	47.682932835006795	32.69532821299799
2	48.566335363997496	33.00475709999591
3	49.04837610800314	33.51344140100264
mean	48.43254810233581	33.07117557133218

Let's enable the cuDNN benchmark flag.

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	47.537661023998226	33.475586044995
2	48.195051800998044	32.84605344099691
3	47.35034869299852	32.90194606300065
mean	47.6943538393316	33.07452851633085

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	47.47512355899994	17.723089199000015
2	47.105311723999876	17.64742597999998
3	48.08153719200027	17.38713361100008
mean	47.55399082500003	17.585882930000025

Conclusion

From these results, we can see that PyTorch is significantly faster than TensorFlow at sampling actions. Further, we see that using GPU doesn't affect the execution time of Tensorflow while for PyTorch, the execution time gets longer. The execution time, however, stays significantly lower for the PyTorch version even when GPU is enabled. This execution time does not improve while enabling the cuDNN benchmark and fastest flags.

Compare stand-alone Log probability function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	45.78058913299992	38.37326285200015
2	44.57832863599924	38.93101751900031
3	45.58136013500007	38.56264149399976
mean	45.31342596799974	38.62230728833341

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	48.07486991299993	88.60940667599971
2	47.214407829000265	86.43189353300022
3	47.78040611400047	85.64942697800052
mean	47.68989461866689	86.89690906233348

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	45.30441331600014	41.29998277499999
2	45.31698288100051	38.515963581000506
3	44.346509671000604	39.96837626400065
mean	44.98930195600042	39.92810754000038

Conclusion

From the above results, we can see that calculating the LogProbability is slightly faster in Pytorch when using CPU mode. When GPU mode is used, Pytorch becomes slower. This result is expected from the results above. It also shows that the contribution of the sampling action to the execution time is smaller than the log_probability.

Compare Log probability + Squashing execution speed

Now in addition to calculating the log_probability of the sampled action lets also squash the computed action while applying a correction for this squashing on the calculated log_probabilities.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	50.94963947499855	79.64743485200233
2	50.60395075799897	78.01757510300013
3	50.43615571299961	80.14881448499727
mean	50.66324864866571	80.14881448499727

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	50.378254114999436	151.5695584310015
2	48.73778436000066	153.44167755200033
3	48.58900336899751	150.91326868199758
mean	49.2350139479992	151.97483488833313

GPU GLOBAL cudnn.benchmark

Let's enable the benchmark flag.

run	TF execution time	Pytorch execution time
1	49.12833458800014	152.07146545599971
2	50.353105712005345	151.1550173670039
3	48.28671533699526	154.40869215199928
mean	49.256051879000246	152.54505832500095

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	48.219657355999516	156.77407888899324
2	47.961540133997914	152.9459018090056
3	47.866569441997854	153.34349197799747
mean	48.0159223106651	154.35449089199878

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	48.151476297000045	80.33919906399933
2	47.25795930499953	77.96675552500164
3	47.56391130200063	79.42569091699988
mean	47.657782301333405	79.24388183533362

Conclusion

The results above show that the Pytorch version becomes slower than the TensorFlow version when we add the Squash operation. Further, it again shows that enabling the GPU increases the execution time for the Pytorch version while it doesn't change anything for the TensorFlow version.

Compare stand-alone squashing function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	44.14887383600035	70.58557372399991
2	47.06251954599975	70.10898442500002
3	44.60933404100069	67.97248776999913
mean	45.27357580766693	69.5556819729997

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	47.64849690200026	138.98152366400063
2	49.2268215790009	137.38257430200065
3	47.322883758999524	139.86679378100052
mean	48.06606741333356	138.74363058233394

Enabling the cuDNN benchmark or fastest flags does not change the speed. TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	47.66489131900016	69.29213760200037
2	49.14248128899999	68.04381407999972
3	48.16409693700007	67.40893307699844
mean	48.32382318166674	68.24829491966618

Conclusion

From the results above we see that the Squashing operation is indeed the bottleneck which causes the Pytorch version to be slower than the Tensorflow version. We, therefore, have to look if we can speed up this squashing in the Pytorch version.

Compare linear layer forward pass time

Now let's compare a forward pass to the linear layers of the networks.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	375.18870995100224	320.4068507410011
2	369.548168835001	311.17305828499957
3	370.5892158940005	313.39369041700047
mean	373.655545265335	314.9911998143337

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	229.82201903599343	223.34556611099833
2	224.82652258500457	223.9562453960025
3	227.6470035270031	222.9704135100037
mean	227.43184838266703	223.4240750056682

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run	TF execution time	Pytorch execution time
1	238.45478896200075	224.40128824699786
2	235.52634897700045	224.19951804900484
3	230.21658990500146	225.0826409250003
mean	234.73257594800089	224.56114907366768

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	231.63010256599955	219.30785131300217
2	223.39881358000275	225.78243027799908
3	228.4898673600037	228.4898673600037
mean	227.839594502002	224.52671631700164

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run	TF execution time	Pytorch execution time
1	234.0862072489981	223.88553328599664
2	231.2203368430055	223.88553328599664
3	234.37278411599982	224.01825483499852
mean	233.22644273600113	224.01825483499852

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run	TF execution time	Pytorch execution time
1	225.9497257460025	276.2206137460016
2	231.38425711599848	273.4203483720048
3	230.72190967400093	274.2106625689994
mean	229.35196417866732	274.61720822900196

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	205.74492095799724	313.0814373590001
2	223.43292250000013	333.1711495269992
3	230.10195052199924	324.7172065039995
mean	219.75993132666554	323.6565977966663

Enabling the cuDNN benchmark or fastest flags does not change the speed.

Conclusion

From these results, we see that the forward pass is faster in the Pytorch version. The only case where the Tensorflow version is faster is when we write the data back to the CPU with every pass. This is as expected as this operation is computationally expensive, and the Tensorflow version keeps the data on the GPU.

Compare full networks forward pass time

Now let's compare a forward pass through the whole networks, which means including the sampling action, log probability and squashing action.

CPU GLOBAL

run	TF execution time	Pytorch execution time
1	411.22191030799877	536.3658260090015
2	412.4232401260015	513.8089201249968
3	414.8655814080048	525.9684370820032
mean	412.8369106140017	525.3810610720005

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	336.8346674320055	432.64605384600145
2	337.65133418299956	432.2832743270046
3	335.77879138099524	428.30122067600314
mean	336.7549309986668	431.0768496163364

Enabling benchmark and fastest does not change the speed.

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run	TF execution time	Pytorch execution time
1	336.0141714460042	414.5644553680031
2	339.7367062159901	418.4468067179987
3	343.35170658299467	427.59492996100744
mean	339.7008614149963	420.20206401566975

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run	TF execution time	Pytorch execution time
1	330.5518290850014	487.8034812370024
2	341.3577948439997	487.35418161000007
3	343.21883232100026	496.9929247519999
mean	338.3761520833338	490.7168625330008

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	335.62528117199963	512.721542063
2	337.9906162549996	515.8002393119996
3	341.41880026500075	526.5174871430008
mean	338.34489923066667	518.3464228393335

Conclusion

From the results above, we see that for the full forward pass Tensorflow is significantly faster. As explained above, this is probably caused by the slower execution speed of the squash operation.

Overal conclusion

From the results above we can conclude the following:

Pytorch is faster on both the CPU and GPU for all operations except the squashing operation.
The squashing operation is a contributor in the slower Pytorch training Loop.
- Other speed differences are likely caused by optimizations that are done in the tf.function or other inefficient functions in the Pytorch learn method.
When we perform our computations on the GPU, the PyTorch execution time becomes larger in most cases. This is as expected since because of the small size of our networks.

rickstaa commented 3 years ago

Closed due to inactivity.

rickstaa / torch-tf2-lac-speed-compare