rickstaa commented 4 years ago

Describe the bug When we run the same experiment in torch and TF1 the torch version takes significantly longer.

To Reproduce Steps to reproduce the behaviour:

Run the train.py file in the LAC_CLEANED_TORCH folder.
Run the train.py file in the LAC_TF1_ORIGINAL folder.

Expected behaviour Similar speeds.

Screenshots For training 4 rollouts we get the following results. Left is torch right TensorFlow:

Desktop (please complete the following information):

OS: Ubuntu 18.04

Possible debug steps

[x] Check whether each operation is efficient.
[x] Check if gradients are detached if not used anymore.
[x] Check floating point precission.
[x] Check if operations that are faster on the GPU are placed on the GPU.
[x] Use the torch and python profilers to see where the bottlenecks are.
[x] rickstaa/LAC_TF2_TORCH_REWRITE#17 Check why tf2 is slower than tf1.

rickstaa commented 4 years ago

Profiler results

Cprofiler report

We can inspect the script using the python Cprofiler:

python -m cProfile -o myscript.cprof myscript.py

This resulted in the following report:

torch_train.zip

This report can be opened using the pyprof2calltree -k -i torch_train.cprof command. For this, you need to install the kcachegrind Debian package.

From this report, we can see that most of the time is spent running the Adam optimiser and Network forward functions.

Torch.utils.bottleneck report

Below are the results of running the torch.utils.bottleneck utility.

Possible improvements

[x] Put operations on GPU see this issue.
[x] Set the Torch default dtype.
[x] Remove the dtype casting operation in the pool.py#L114.
- Make sure that this doesn't need to be performed with every sample action.

Operations on GPU

According to Takizawa et al. 2009, This only improves speed if your hidden layers contain more than 128 neurons. However, this article was published in 2009 and has an example that was written in C++. Also, judging from this issue and the paper of Brito et al. 2016 it should not matter much for our small network. However, if we look at the Torch.utils.bottleneck report we can see that the script might be faster if we put both the network and replay buffer on the GPU.

Results

For the current network training on the GPU is slower:

This might be caused because we need to write the return values from the learn() function from the GPU to the CPU to be able to compute the diagnostics. In the spinning up version, this is done during logging thus not at every SGD.

GPU bottleneck report

The results of the GPU bottleneck report (nvprof --print-gpu-trace python train.py) are:

System calls bottleneck report

The results of the system calls bottleneck report (strace -fcT python training.py -e trace=open, close, read) are:

rickstaa commented 4 years ago

Check floating point precision

Torch uses the float32 dtype as the default in tensorflow the dtype of the placeholders is also defined to be float32. Further any float64 values that are returned from the environment (because numpy uses float64) are casted into float32 when the transition dictionary is created in the replay buffer:

transition = {'s': s, 'a': a, 'd': d,'raw_d':raw_d, 'r': np.array([r]), 'terminal': np.array([terminal]), 's_': s_}

rickstaa commented 4 years ago

LAC_CLEANED_TORCH vs LAC_TF2_CLEANED_EAGER_SPEEDUP speed comparison

As seen from #17 we can see that the LAC_TF2_CLEANED_EAGER_SPEEDUP is currently the fastest TF version. As a result, we use this version to compare with TensorFlow. For the environment, we choose the oscillator as it is relatively easy. During the comparison, we will do five rollouts of 1.1e4 steps.

Deterministic results

As torch and TF use different random number generators, it is not possible to make both environments entirely similar and deterministic. The only thing we can make similar is the initial environment state. It is therefore crucial that we remember that the initial weights and policy noise cause some of the speed differences. I will try to use different RANDOM seeds to negate this difference.

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

RANDOM_SEED = 0	run	Execution time
1	45.05048489570618
2	43.75153851509094
3	43.895511865615845
4	45.38190507888794
5	43.76770758628845
mean	44.36942958831787

LAC_CLEANED_TORCH (Eager mode)

RANDOM_SEED = 0

run	Execution time
1	84.154057264328
2	84.08330941200256
3	84.85945796966553
4	83.90588426589966
5	83.34130358695984
mean	84.06880249977112

Conclusion

From this, it looks like a torch is about 2 times slower than TensorFlow but we have to keep in mind that we used the tf.function wrapper in the TensorFlow version. This wrapper compiles the python code into a static graph before executing the code. When we disable this compilation the Tensorflow versions becomes significantly slower than the PyTorch version:

run	Execution time
1	1249.4554846286774
2	1248.4554846286774
3	1235.324506999488
mean	1244.4118254189477

This leads to an unfair comparison. To have a fair comparison, we first have to optimize the Torch code using the equivalent torchscript wrapper. This feature is added in #22.

rickstaa commented 4 years ago

Torch version speed debug

As can be seen from the report above we can see that although Pytorch is significantly faster than TF2.0 in the pure eager mode, it is slower than TF2.0 when the tf.function wrapper is used. This is as expected as with this wrapper compiles the enclosed python code to a static graph. In this comment, I will try to investigate if it is possible to speed up the Pytorch version using the equivalent torchscript wrapper. I will further try to see if there are some other optimisations I can do on the Pytorch code.

Current profiling results

Let's look at possible speed bottlenecks in the PyTorch code by inspecting the spyder profiler reports. We do this by letting one agent perform 1.1e4 steps in the Oscillator environment. We do this while disabling the GPU.

Execution speed overview

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

RANDOM_SEED = 0

run	Execution time
1	52.047799587249756
2	51.3368079662323
3	50.19379186630249
4	50.3440146446228
5	50.664284467697144
mean	50.9173397064209

LAC_CLEANED_TORCH (Eager mode)

RANDOM_SEED = 0

run	Execution time
1	83.62926721572876
2	82.74273562431335
3	81.42218828201294
4	82.77582883834839
5	82.26928067207336
mean	82.56786012649536

Profiling comparison

Let's compare the TF2.0 and Pytorch reports.

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

Execution time: 51.663371562957764

LAC_CLEANED_TORCH (Eager mode)

Analysis

From this, we can see that the Pytorch code is on average 1.62 times slower than the optimised TF2.0 code. While making this comparison, it is important to note that in the TF2.0 version the following methods were wrapped with the tf.function wrapper and will therefore show up under the __call__ method:

choose_action
learn
target_init
update_target
LyapunovCritic.call (Forward and backward pass)
GaussianActor.call (Forward and backward pass)

Step method

First of all, we see that the step function is faster in the PyTorch code. This seems to be caused because the amax function takes less time in the Pytorch version. This is unexpected as the same python, NumPy and environment are used. Could be an error in the profiler.

Sample and Store method

Here the results between PyTorch and TensorFlow are similar. This is as expected as the code was not changed.

Learn and choice_action methods

For these methods, TensorFlow is faster as the tf.function wrapper is used. We further can see that most of the speed in the Pytorch version is lost during the forward and backward pass. In these functions the following methods take the most time:

log_prob: 3.67 s
rsample: 1.72
linear: 10.28 s
relu: 2.65 s

After that, it is the update_target method that takes the most time.

Speed up PyTorch bottlenecks

To speed up the Pytorch version we can first check if there are changes we can make to the python code of the bottlenecks. After that, we can try to use the torchscript wrapper to create a static graph.

Python code improvements

Log_prop

I looked on the GitHub, and there is currently no bug regarding the speed of the log_prob function. I also did not find ways to improve the speed of this method.

Rsample

There is currently are two issues that discuss a slowness in the rsample method:

They are not related to our use-case but might lead to a speed improvement in the rsample method in the future. I, however, for now, could not find a way to speed up this method in eager mode.

Linear and Relu

There are no issue open about speed problems with the nn.linear and nn.relu methods. I also did not find a way to speed this without using the torchscript wrapper.

Pytorch related improvements (Apply Torchscript wrapper)

Let's try to speed up the bottleneck functions using the torchscript wrapper. I will do this in the LAC_CLEANED_TORCH_SPEEDUP folder.

Resources

Choose_action and update_target

The Torchscript module in the stable Pytorch does not yet have support for the @unused and @ignore decorators when working with normal python classes. As a result, when using pytorch 1.6 I can not speed up the choice_action, update_target and learn methods without rewriting them to separate functions (See this issue). This feature will, however, be included in Pytorch 1.7. In Pytorch 1.7 we still have the problem that we can not yet convert the GaussianActor to a TorchScript (see below) and thus can not convert the choose_action, learn and update_target functions as all functions inside a Torchscript should be TorchScript compatible.

Speed up LyapunovCritic.call method (Forward and Backward pass)

The Torchscript module can be used to speed up the LyapunovCritic.call method. When doing this, it doesn't change the execution time by a lot (improves 1-2 seconds).

Speed up GaussianActor.call method (Forward and Backward pass)

The Torchscript module does not yet work with the sampling action we use in our Gaussian Actor (See https://github.com/pytorch/pytorch/issues/29843 and https://github.com/pytorch/pytorch/issues/18094). We can therefore not fully translate the forward method of the gaussian actor to a Torchscript. We can however apply the [Torchscript wrapper]((https://pytorch.org/docs/stable/jit.html) to the nn.linear layers.

Effect of using TorchScript

run	Execution time
1	83.89904952049255
2	85.36705589294434
3	86.45388126373291
4	85.51795291900635
5	85.36600470542908
mean	85.32078886032104

It appears that changing the forward function of the LyapunovCritic and the linear layers of the GaussianActor into TorchScripts does not improve the performance. As explained in this forum post, this is not strange as these features already use the C++ backend and do therefore not benefit from the optimisations done in the TorchScript module like fusing operations. As a result, only applying this wrapper to the forward method of the GaussianActor or even the full learn method might yield improved performance.

Use GPU

Using the GPU instead of the CPU with our small networks does not improve or decrease the speed much.

USE torch.cudnn.benchmar == True

Sometimes enabling the cuDNN auto-tune algorithm speeds up the training (see this forum post). In our case, this did not help.

Compare Forward/Backward and sample between PyTorch and TensorFlow

As a final step lets compare the speed of the Forward/backward pass and the distributions sample actions. For this, we can create some small dummy timing scripts. These scripts can be found in the sandbox/speed_comparison folder.

Compare sampling execution speed

Let's compare the speed of the sampling action in the Gaussian action between Pytorch and Tensorflow. For this, we use the timeit_sample_speed_compare.py script. In this script, we take 5e5 samples/rsamples from the normal distribution.

Sample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	47.269880842999555	15.277080029998615
2	46.73242047799795	15.733279996999045
3	47.41759105299934	15.471710162000818
mean	47.13996412466562	15.494023396332826

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU). In PyTorch, this is done using the following command:

torch.set_default_tensor_type('torch.cuda.FloatTensor') # Enable global GPU

run	TF execution time	Pytorch execution time
1	43.95948519900048	31.157700912997825
2	45.31296913999904	30.35345944400251
3	44.46277614200153	29.964271681001264
mean	44.57841016033368	30.491810679333867

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.benchmark = True

run	TF execution time	Pytorch execution time
1	44.35301048200199	29.58910515900061
2	44.9747937630018	29.630526167995413
3	45.64101090400072	29.92008007500408
mean	44.98960504966817	29.92008007500408

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag. In PyTorch, this is done using the following command:

torch.backends.cudnn.fastest = True

run	TF execution time	Pytorch execution time
1	44.429974582002615	31.8192962969988
2	44.92464860399923	30.395992316996853
3	44.754384061001474	30.31036730899359
mean	44.754384061001474	30.84188530766308

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	43.431389666000086	15.98217932800003
2	43.28244902200004	15.648057652000034
3	41.674334185000134	15.587252373000183
mean	42.79605762433342	15.739163117666749

Rsample results

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	49.16316856100093	17.141713114000595
2	49.98035491799965	17.08107351599756
3	50.76262897200286	17.62022704699848
mean	49.96871748366781	17.62022704699848

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	48.19858287099851	32.428321216997574
2	47.84914048400242	32.55513772999984
3	48.16133704599997	32.596067703998415
mean	48.069686800333635	32.52650888366528

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run	TF execution time	Pytorch execution time
1	47.682932835006795	32.69532821299799
2	48.566335363997496	33.00475709999591
3	49.04837610800314	33.51344140100264
mean	48.43254810233581	33.07117557133218

Let's enable the cuDNN benchmark flag.

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	47.537661023998226	33.475586044995
2	48.195051800998044	32.84605344099691
3	47.35034869299852	32.90194606300065
mean	47.6943538393316	33.07452851633085

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	47.47512355899994	17.723089199000015
2	47.105311723999876	17.64742597999998
3	48.08153719200027	17.38713361100008
mean	47.55399082500003	17.585882930000025

Conclusion

From these results, we can see that PyTorch is significantly faster than TensorFlow at sampling actions. Further, we see that using GPU doesn't affect the execution time of Tensorflow while for PyTorch, the execution time gets longer. The execution time, however, stays significantly lower for the PyTorch version even when GPU is enabled. This execution time does not improve while enabling the cuDNN benchmark and fastest flags.

Compare Log probability execution speed

Now, in addition to this sampling operation, let's also calculate the log_probabilities of the sampled action.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	50.36903019700185	47.872389495998505
2	51.30757960200208	46.29049172399755
3	49.60291533399868	47.005423079997854
mean	50.42650837766754	47.0561014333313

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	48.91270494000128	100.27575907999926
2	47.13373410799977	98.49456620199999
3	46.99059675100216	99.80839579399981
mean	47.67901193300107	99.52624035866636

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run	TF execution time	Pytorch execution time
1	48.4585376219984	101.22743981500389
2	45.78058913299992	97.0079226279995
3	47.77440226700128	99.20072890699521
mean	48.68052208800024	99.14536378333287

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	48.1281833979956	101.55883601200185
2	47.37521640000341	100.41289296699688
3	47.830623365000065	99.11197711800196
mean	47.77800772099969	100.3612353656669

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	47.37839867199909	48.91452547500012
2	47.48065972999757	46.51999894999972
3	46.97186975300065	46.97186975300065
mean	47.27697605166577	47.468798059333494

Conclusion

From this, we can see that calculating the log_probability from a sampled action is slightly faster in Pytorch when it is done on the CPU. The execution time increases in the Pytorch when the calculations are done on the GPU. For TensorFlow, the execution time decreases when the calculations are performed on the GPU. Additionally, this result also suggests that the stand-alone log_probability calculation is slower in Pytorch than in TensorFlow as the sampling operation is faster in Pytorch, but the total time is the same.

Compare stand-alone Log probability function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	45.78058913299992	38.37326285200015
2	44.57832863599924	38.93101751900031
3	45.58136013500007	38.56264149399976
mean	45.31342596799974	38.62230728833341

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	48.07486991299993	88.60940667599971
2	47.214407829000265	86.43189353300022
3	47.78040611400047	85.64942697800052
mean	47.68989461866689	86.89690906233348

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	45.30441331600014	41.29998277499999
2	45.31698288100051	38.515963581000506
3	44.346509671000604	39.96837626400065
mean	44.98930195600042	39.92810754000038

Conclusion

From the above results, we can see that calculating the LogProbability is slightly faster in Pytorch when using CPU mode. When GPU mode is used Pytorch becomes slower. This result is expected from the results above. It also shows that the contribution of the sampling action to the execution time is smaller than the log_probability.

Compare Log probability + Squashing execution speed

Now in addition to calculating the log_probability of the sampled action lets also squash the computed action while applying a correction for this squashing on the calculated log_probabilities.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	50.94963947499855	79.64743485200233
2	50.60395075799897	78.01757510300013
3	50.43615571299961	80.14881448499727
mean	50.66324864866571	80.14881448499727

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	50.378254114999436	151.5695584310015
2	48.73778436000066	153.44167755200033
3	48.58900336899751	150.91326868199758
mean	49.2350139479992	151.97483488833313

GPU GLOBAL cudnn.benchmark

Let's enable the benchmark flag.

run	TF execution time	Pytorch execution time
1	49.12833458800014	152.07146545599971
2	50.353105712005345	151.1550173670039
3	48.28671533699526	154.40869215199928
mean	49.256051879000246	152.54505832500095

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	48.219657355999516	156.77407888899324
2	47.961540133997914	152.9459018090056
3	47.866569441997854	153.34349197799747
mean	48.0159223106651	154.35449089199878

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	48.151476297000045	80.33919906399933
2	47.25795930499953	77.96675552500164
3	47.56391130200063	79.42569091699988
mean	47.657782301333405	79.24388183533362

Conclusion

The results above show that the Pytorch version becomes slower than the TensorFlow version when we add the Squash operation. Further, it again shows that enabling the GPU increases the execution time for the Pytorch version while it doesn't change anything for the TensorFlow version.

Compare stand-alone squashing function

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	44.14887383600035	70.58557372399991
2	47.06251954599975	70.10898442500002
3	44.60933404100069	67.97248776999913
mean	45.27357580766693	69.5556819729997

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	47.64849690200026	138.98152366400063
2	49.2268215790009	137.38257430200065
3	47.322883758999524	139.86679378100052
mean	48.06606741333356	138.74363058233394

Enabling the cuDNN benchmark or fastest flags does not change the speed. TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	47.66489131900016	69.29213760200037
2	49.14248128899999	68.04381407999972
3	48.16409693700007	67.40893307699844
mean	48.32382318166674	68.24829491966618

Conclusion

From the results above we see that the Squashing operation is indeed the bottleneck which causes the Pytorch version to be slower than the Tensorflow version. We, therefore, have to look if we can speed up this squashing in the Pytorch version.

Compare linear layer forward pass time

Now let's compare a forward pass to the linear layers of the networks.

CPU GLOBAL

Let's disable GPU for both TensorFlow and Pytorch.

run	TF execution time	Pytorch execution time
1	375.18870995100224	320.4068507410011
2	369.548168835001	311.17305828499957
3	370.5892158940005	313.39369041700047
mean	373.655545265335	314.9911998143337

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	229.82201903599343	223.34556611099833
2	224.82652258500457	223.9562453960025
3	227.6470035270031	222.9704135100037
mean	227.43184838266703	223.4240750056682

GPU GLOBAL cudnn.benchmark

Let's enable the cuDNN benchmark flag.

run	TF execution time	Pytorch execution time
1	238.45478896200075	224.40128824699786
2	235.52634897700045	224.19951804900484
3	230.21658990500146	225.0826409250003
mean	234.73257594800089	224.56114907366768

GPU GLOBAL cudnn.fastest

Let's enable the cuDNN fastest flag.

run	TF execution time	Pytorch execution time
1	231.63010256599955	219.30785131300217
2	223.39881358000275	225.78243027799908
3	228.4898673600037	228.4898673600037
mean	227.839594502002	224.52671631700164

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run	TF execution time	Pytorch execution time
1	234.0862072489981	223.88553328599664
2	231.2203368430055	223.88553328599664
3	234.37278411599982	224.01825483499852
mean	233.22644273600113	224.01825483499852

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run	TF execution time	Pytorch execution time
1	225.9497257460025	276.2206137460016
2	231.38425711599848	273.4203483720048
3	230.72190967400093	274.2106625689994
mean	229.35196417866732	274.61720822900196

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	205.74492095799724	313.0814373590001
2	223.43292250000013	333.1711495269992
3	230.10195052199924	324.7172065039995
mean	219.75993132666554	323.6565977966663

Enabling the cuDNN benchmark or fastest flags does not change the speed.

Conclusion

From these results, we see that the forward pass is faster in the Pytorch version. The only case where the Tensorflow version is faster is when we write the data back to the CPU with every pass. This is as expected as this operation is computationally expensive, and the Tensorflow version keeps the data on the GPU.

Compare full networks forward pass time

Now let's compare a forward pass through the whole networks, which means including the sampling action, log probability and squashing action.

CPU GLOBAL

run	TF execution time	Pytorch execution time
1	411.22191030799877	536.3658260090015
2	412.4232401260015	513.8089201249968
3	414.8655814080048	525.9684370820032
mean	412.8369106140017	525.3810610720005

GPU GLOBAL

Let's force the GPU to be used in Pytorch and use the default TensorFlow behaviour (Use GPU).

run	TF execution time	Pytorch execution time
1	336.8346674320055	432.64605384600145
2	337.65133418299956	432.2832743270046
3	335.77879138099524	428.30122067600314
mean	336.7549309986668	431.0768496163364

Enabling benchmark and fastest does not change the speed.

GPU customised (Without writing back to CPU)

Now let's use GPU while not writing the results of the forward pass back to the CPU at every forward pass.

run	TF execution time	Pytorch execution time
1	336.0141714460042	414.5644553680031
2	339.7367062159901	418.4468067179987
3	343.35170658299467	427.59492996100744
mean	339.7008614149963	420.20206401566975

Enabling the cuDNN benchmark or fastest flags does not change the speed.

GPU customised (With writing back to CPU)

Now let's use GPU while also writing the results back from the GPU to the CPU with a forward pass.

run	TF execution time	Pytorch execution time
1	330.5518290850014	487.8034812370024
2	341.3577948439997	487.35418161000007
3	343.21883232100026	496.9929247519999
mean	338.3761520833338	490.7168625330008

Enabling the cuDNN benchmark or fastest flags does not change the speed.

TF GPU, PyTorch CPU

Let's disable GPU for PyTorch and keep it enabled in TensorFlow.

run	TF execution time	Pytorch execution time
1	335.62528117199963	512.721542063
2	337.9906162549996	515.8002393119996
3	341.41880026500075	526.5174871430008
mean	338.34489923066667	518.3464228393335

Conclusion

From the results above, we see that for the full forward pass Tensorflow is significantly faster. As explained above, this is probably caused by the slower execution speed of the squash operation.

Overal conclusion

From the results above we can conclude the following:

Pytorch is faster on both the CPU and GPU for all operations except the squashing operation.
The squashing operation is a contributor in the slower Pytorch training Loop.
- Other speed differences are likely caused by optimizations that are done in the tf.function or inefficient functions in the Pytorch leran method.
When we perform our computations on the GPU, the PyTorch execution time becomes larger in most cases. This is as expected since because of the small size of our networks.

rickstaa commented 4 years ago

Squashing operation speed improvement. Investigation

Let's examine what could cause the squashing operation to be slower in the Pytorch version.

Are the two solutions equal

I double-checked, and both the Pytorch and Tensorflow solutions are equal. The only difference can be found in the way the two frameworks calculate the log_probabiliy of the sampled actions that come from the SquashedGaussian.

Pytorch

https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/9a359f62a9214e828823c437697e0a845a3b3d3f/LAC_CLEANED_TORCH/gaussian_actor.py#L113-L125

In PyTorch, these log probabilities are calculated by calculating them from the non-squashed distribution and then applying a correction for the Tanh squashing.

Tensorflow

https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/9a359f62a9214e828823c437697e0a845a3b3d3f/LAC_TF2_CLEANED_EAGER_SPEEDUP/squash_bijector.py#L18-L27

Tensorflow uses the bijectors and TransformedDistribution modules to achieve the same.

Where does the speed difference come from

As seen above the two Frameworks use the same formulas for calculating the squashed log probabilities.

Check individual components

There are two possible candidates which cause the extra execution time for the Pytorch version:

The action squash operation
The distribution squash operation

Log_prob calculation

Lets first get a baseline by measuring the log_prob execution time. In these tests, Pytorch is using CPU while Tensorflow uses the default device (GPU).

run	TF execution time	Pytorch execution time
1	130.1151521205902	96.50015997886658
2	133.6686074733734	96.55137491226196
3	129.1776819229126	95.73350787162781
mean	130.87938650449118	96.26168092091878

Log_prob + action squash

run	TF execution time	Pytorch execution time
1	130.4717721939087	107.03060293197632
2	130.64691376686096	105.87263202667236
3	134.26840901374817	105.66786861419678
mean	131.79569832483926	106.19036785761516

Log_prop + action squash + distribution squash

run	TF execution time	Pytorch execution time
1	133.60435914993286	162.03667068481445
2	130.88181281089783	158.7844672203064
3	129.7918701171875	160.77405977249146
mean	131.42601402600607	160.5317325592041

Conclusion

From these results, we see that the biggest execution time drop is found when we squash the distribution. To speed up the Pytorch version we, therefore, have to look if we can make this operation more efficient. This speed difference is strange as the formula which is used for this is the same between the Pytorch and Tensorflow version:

Tensorflow version https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/73ad7d206a7d34438f9cf2883e2dee2c3edcded7/LAC_TF2_CLEANED_EAGER_SPEEDUP/squash_bijector.py#L26

Pytorch version https://github.com/rickstaa/LAC_TF2_TORCH_REWRITE/blob/73ad7d206a7d34438f9cf2883e2dee2c3edcded7/LAC_CLEANED_TORCH_SPEEDUP/gaussian_actor.py#L132-L134

As a result the speed difference can be caused by:

A speed difference between the torch.nn.functional.softplus and tf.math.softplus methods.
A performance issue in the python code for example the -= operation or .sum() method.
A performance optimization done by the bijector and or tf.function wrapper.

Deeper investigation

Softplus method

There were no speed issues with softmax in the Pytorch GitHub. Also in my own tests, the Pytorch Softplus version is significantly faster than the Tensorflow version even when wrapped with the tf.function wrapper.

Python code

According to my own tests and this issue there is no difference in speed for using x+=1 instead of x = x+1.

rickstaa commented 3 years ago

Last speed investigation before transferring to MLC

Let's perform one last investigation on how fast each version is (with the final code) before transferring to the MLC repository. To do this, I perform 0.8e4 steps in the oscillator environment.

Quick tests

LAC_ORIGINAL

CPU:


1: 37.805219650268555 s (8000 steps - seed 0)
2: 36.3125 s (8000 steps - seed 65453)
3: 36.72363305091858 s (8000 steps - seed 3453) 
mean: 37 s

GPU:

The old version is using TensorFlow 1.13.0, which only supports Cuda 10.0. Cuda 10.0 is, however not supported on the os I'm using. I, therefore, had to test this on my old pc.

1: 44.91222882270813 s (8000 steps - seed 0)
2: 45.24729943275452 s (8000 steps - seed 65453)
3: 44.684839487075806 s (8000 steps - seed 3453) 
mean: 37 s

LAC_TORCH

CPU:

Seeds:
1: 65.01357102394104 s (8000 steps - seed 0)
2: 68.49877715110779 s (8000 steps - seed 65453)
3: 67.3913779258728 s (8000 steps - seed 3453)
mean: 67s

GPU:

Seeds:
1: 65.51777911186218 s (8000 steps - seed 0)
2: 65.25916171073914 s (8000 steps - seed 65453)
3: 65.9152410030365 s (8000 steps - seed 3453)
mean: 65 s

LAC_TF2

CPU:

1: 40.394707918167114 s (8000 steps - seed 0)
2: 52.60241150856018 s (8000 steps - seed 65453)
3: 71.55074095726013 s (8000 steps - seed 3453)
mean: 55 s

GPU:

1: 65.01357102394104 s (8000 steps - seed 0)
2: 66.7179274559021 s (8000 steps - seed 65453)
3: 74.95780944824219 s (8000 steps - seed 3453)
mean: 69 s

Longer run

When we do a longer run of 1e5 steps, the difference in speed becomes more clear:

LAC_TORCH (GPU): 845.13680768013 s LAC_TORCH (CPU): 924.6164543628693 s LAC_TF2: 573.4957985877991 s

We see that training the script takes nearly 1.47 times as long when using Pytorch than when we use TF2.

Investigation

When we look at the spyder reports, we see that the difference is mainly caused by the step and learn/call function.

This is, however very strange as the step function is the same in both versions. I, therefore, think the profiler was unable to correctly measure the time of the individual components in the TensorFlow case (which compiles the code that is wrapped by the tf.function decorator). Let's try one more time using function trace.

Conclusion

For this, we can see that the CPU speed really depends on the random seed we use. The GPU speed is more stable. Further, we see that overall the old version is a little bit faster than the new version. It further looks like the TF2 version is a little bit faster than the Torch version. This is not strange as TensorFlow is using complied code (using the tf.function) decorator where PyTorch is using python code of which the low-level components are written in c++. If we disable the tf.function decorator the TensorFlow code becomes 10 times slower than Pytorch. We can however not be 100% certain as we did not use more seeds. The torch version when on GPU is, however, faster than the TensorFlow version when it is using GPU. This is as expected as for the PyTorch version I only put the big networks on the GPU where the TensorFlow version tries to do more things on the GPU. From the reports above we can also see that the biggest speed difference is found in the forward and backward passes.

A way to speed this up would use the torch script wrapper. This currently is not yet supported as torch script wrapper does not support the distributions that are used in the gaussian actor (see https://github.com/pytorch/pytorch/issues/18094 and https://github.com/pytorch/pytorch/issues/29843).

rickstaa / LAC-TF2-TORCH-translation

LAC Torch version is slower than TF1 #18

Profiler results

Cprofiler report

Torch.utils.bottleneck report

Possible improvements

Operations on GPU

Results

GPU bottleneck report

System calls bottleneck report

Check floating point precision

LAC_CLEANED_TORCH vs LAC_TF2_CLEANED_EAGER_SPEEDUP speed comparison

Deterministic results

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

LAC_CLEANED_TORCH (Eager mode)

Conclusion

Torch version speed debug

Current profiling results

Execution speed overview

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

LAC_CLEANED_TORCH (Eager mode)

Profiling comparison

LAC_TF2_CLEANED_EAGER_SPEEDUP (using tf.function wrapper)

LAC_CLEANED_TORCH (Eager mode)

Analysis

Step method

Sample and Store method

Learn and choice_action methods

Speed up PyTorch bottlenecks

Python code improvements

Log_prop

Rsample

Linear and Relu

Pytorch related improvements (Apply Torchscript wrapper)

Resources

Choose_action and update_target

Speed up LyapunovCritic.call method (Forward and Backward pass)

Speed up GaussianActor.call method (Forward and Backward pass)

Effect of using TorchScript

Use GPU

USE torch.cudnn.benchmar == True

Compare Forward/Backward and sample between PyTorch and TensorFlow

Compare sampling execution speed

Sample results

Rsample results

Conclusion

Compare Log probability execution speed

Conclusion

Compare stand-alone Log probability function

Conclusion

Compare Log probability + Squashing execution speed

Conclusion

Compare stand-alone squashing function

Conclusion

Compare linear layer forward pass time

Conclusion

Compare full networks forward pass time

Conclusion

Overal conclusion

Squashing operation speed improvement. Investigation

Are the two solutions equal

Pytorch

Tensorflow

Where does the speed difference come from

Check individual components

Log_prob calculation

Log_prob + action squash

Log_prop + action squash + distribution squash

Conclusion

Deeper investigation

Softplus method

Python code

Last speed investigation before transferring to MLC

Quick tests

LAC_ORIGINAL

LAC_TORCH

LAC_TF2

Longer run

Investigation

Conclusion