rickstaa / LAC-TF2-TORCH-translation

Temporary repository to debug what goes wrong during the translation of the LAC algorithm from TF1 to Torch.
0 stars 1 forks source link

TF2 version slower than TF1 version #17

Closed rickstaa closed 4 years ago

rickstaa commented 4 years ago

User story

Since we want to be able to make the PyTorch version as fast as the TF1 version it might be a good idea to start why the TF2 version is slower than the tf1 version.

Resources

Main takeaways

Difference between the versions

TF115

In TF115 tfp.distributions.ConditionalTransformedDistribution is replaced by tfp.distributions.TransformedDistribution.

TF2

In TF2 tfp.bijectors.Affine is replaced by a combination of tfp.bijector.Shift and tfp.bijector.Scale.

Possible causes

Tools

rickstaa commented 4 years ago

Compare speed between version

I tried to do this while training to keep:

Because of a change in Random Number Generator of the GlorotUniform initialiser, this is however not possible when comparing tf1.15<= versions with tf>=1.15 without significantly rewriting one of the versions. As a result, the weights of the networks are different, and we, therefore, do not have a 100% fair comparison. I, however, do not expect this difference in network weight initiation to cause a significant difference in running time.

LAC_TF1_ORIGINAL

Tensorflow 1.13

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 Time: 76.22983193397522
2 Time: 79.77554988861084
3 Time: 81.70465159416199
4 Time: 69.69277858734131

The Cprofiler log file and results are:

tf1.zip

image

GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 110.71875476837158
2 103.70752668380737
3 101.92001938819885
4 105.97694063186646

The Cprofiler log file and results are:

tf1_original_gpu.zip

image

LAC_TF1_ORIGINAL_CLEANED

Tensorflow 1.13.0

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 72.85364580154419
2 79.3347327709198
3 69.82149505615234
4 71.70584893226624

The Cprofiler log file and results are:

tf1_cleaned.zip

image

Tensorflow 1.15.0

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 75.8878424167633
2 80.95311832427979
3 71.45233821868896
4 73.24926376342773

The Cprofiler log file and results are:

tf115_short_run_2.zip

image

LAC_TF2_GRAPH

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 90.4486939907074
2 80.69330954551697
3 96.18579578399658
4 91.07221174240112

The Cprofiler log file is:

tf2_cleaned_graph.zip

image

Tensorflow 2.3

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 82.71706700325012
2 86.21673941612244
3 88.57738018035889
4 85.81253170967102

The Cprofiler log file is:

tf2_cleaned_graph.zip

image

LAC_TF2_EAGER

Tensorflow 2.2

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 97.44767260551453
2 128.99831318855286
3 105.46797251701355
4 110.24135160446167

The Cprofiler log file is:

tf2_cleaned_eager_2_2.zip

image

GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 163.80216550827026
2 215.6815378665924
3 204.68237948417664
4 132.58426022529602

The Cprofiler log file is:

tf2_cleaned_eager_2_2_gpu.zip

image

Tensorflow 2.3

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 145.21399974822998
2 150.8920443058014
3 143.57212162017822
4 142.8047535419464

The Cprofiler log file is:

tf2_cleaned_eager_2_3.zip

image

GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout Time
1 249.57833003997803
2 255.32116079330444
3 233.76185131072998
4 241.26305603981018

The Cprofiler log file is:

tf2_cleaned_eager_2_3_gpu.zip

image

Conclusion

From this, we can see that there is a high variance in the CPU load, speed and execution time. In all versions, because of the small size of our network, the GPU version is much slower. Further speed seems to decrease drastically when we move from the TF2_GRAPH version to the TF2_EAGER version. Strangely the speed decreases further if we update from tf 2.2 to tf 2.3.

The difference between the differen Graph based versions is partly caused by the way how the script is stoped when the max_global_step has been reached. Currently the script doesn't stop right away but performs more steps until the episode is over. This has to be fixed in our further comparison.

Some of the difference between the GRAPH and EAER version is caused because the noise inside the Gaussian actor is seeded differently. Due to the difference between the new Eager (object orientation) and old Graph versions we can not easily sample this noise with the same random seed. Currently in the Graph version when we create the graph of the next_state through the Gaussian actor the noise sampler is seeded again wherein the Eager version the same sampler is used:

lya_a_, _, _ = self._build_a(self.S_, reuse=True, seeds=self.ga_seeds)

I won't do this since this is a lot of work and I don't suspect this to cause the big difference in execution time.

rickstaa commented 4 years ago

Inspect the effect of using a different random seed

Let's check how much changing the random seed affects the runtime.

LAC_CLEAN_TF115 (TF1.15)

RANDOM_SEED=0

run Execution time
1 61.93721580505371
2 64.42950749397278
3 61.094213247299194
mean 62.48697884877523

RANDOM_SEED=6

run Execution time
1 61.93721580505371
2 64.42950749397278
3 61.094213247299194
mean 62.48697884877523
RANDOM_SEED=56 run Execution time
1 70.52894282341003
2 73.83393502235413
3 67.12420463562012
mean 70.49569416046143
RANDOM_SEED=1180 run Execution time
1 71.90123772621155
2 64.69407534599304
3 70.29362034797668
mean 68.9629778067271
RANDOM_SEED=0 run Execution time
1 67.44287109375
2 64.12875771522522
3 66.90567302703857
mean 66.33816663424174
RANDOM_SEED=4 run Execution time
1 64.22623944282532
2 70.38289451599121
3 68.56257486343384
mean 67.72390294075012

LAC_CLEAN_TF115 (TF2.3)

run Execution time
1 65.69019556045532
2 69.83095479011536
3 73.19888234138489
mean 69.57334423065186

LAC_TF2_CLEANED_GRAPH (TF2.3)

run Execution time
1 75.48882007598877
2 78.04177689552307
3 78.71057200431824
mean 77.41372299194336

Conclusion

From this it seems that changing the seed which changes the initial environment state, initial weights and biases and policy noise sampling sequence does affect the total run time. We, however, have to keep in mind that we calculated the mean over only 3 runs and therefore not be 100% certain this difference in execution time is not caused by the internal variance. We will use the same random seed in our further comparison just to be sure.

rickstaa commented 4 years ago

Inspect possible speed differences between the Graph and Eager based Tensorflow versions

Let's first compare the runtimes of all the different GRAPH based TF versions:

While doing so, we have to keep in mind that the final runtime depends on:

As a result, during this comparison, I made sure that:

Make scripts more deterministic

Fix max_global_step breakout

First, To be able to compare the scripts, we have to make it more deterministic. Currently, the code that breaks the training loop based on the set max_global_step is placed before the episode loop:

if global_step > ENV_PARAMS["max_global_steps"]:
    print(f"Training stopped after {global_step} steps.")
    break

As a result, the steps perform in different runs differ. We, therefore, move this code inside the episode loop. Also the number of evaluation paths that are used.

Set the num_of_evaluation_paths to zero

To be able to make a fair comparison, for now, it is better to disable the test performance evaluation. We can later enable this again if we compare the Tensorflow and Pytorch version. To disable the test performance evaluation, we can we have to set the num_of_evaluation_paths to zero. By doing this, we get rid of execution time differences that are caused by the fact that the policy has not yet converged with the small number of epochs we use. If we keep it enabled a suboptimal policy causes the agent to reach the terminal state in the test environment faster --> fewer steps will be performed --> meaning a lower execution time.

Make sure that the GPU is disabled in both tf1 and tf2

Where in tf1.15 you had to download the TensorFlow-GPU package to be able to run your computations on the GPU, in tf2.x GPU computations are enabled when a GPU is available. For a fair comparison in tf2.x we, therefore, have to disable the GPU:

# Disable GPU if requested
if not USE_GPU:
    tf.config.set_visible_devices([], "GPU")
    print("Tensorflow is using CPU")
else:
    print("Tensorflow is using GPU")

Compare LAC_TF1_ORIGINAL with LAC_TF1_CLEANED

Non-seeded version

Equal between versions
Env state :x:
Weights/baises :x:
Policy noise :x:

LAC_TF1_ORIGINAL (TF 1.13)

run Execution time
1 60.41492462158203
2 56.50747847557068
3 57.71708631515503
4 57.85236382484436
5 56.82906222343445
mean 57.86418309211731

image

LAC_TF1_ORIGINAL_CLEANED (TF 1.13)

run Execution time
1 56.68896174430847
2 60.61775994300842
3 58.21556615829468
4 57.84604740142822
5 57.93125033378601
mean 58.2428765296936

image

Seeded versions

Equal between versions
Env state :heavy_check_mark:
Weights/baises :heavy_check_mark:
Policy noise :heavy_check_mark:

LAC_TF1_ORIGINAL_SEEDED (TF 1.13)

ENV_SEED = 0 RANDOM_SEED = 0

run Execution time
1 58.13782048225403
2 57.97405934333801
3 57.232763051986694
4 56.710532665252686
5 57.24989628791809
mean 57.461014366149904

image

LAC_TF1_ORIGINAL_CLEANED_SEEDED (TF 1.13)

ENV_SEED = 0 RANDOM_SEED = 0

run Execution time
1 58.51278495788574
2 56.43364667892456
3 57.122565269470215
4 57.078187704086304
5 57.13237237930298
mean 57.25591139793396

image

Conclusion

From the above results, we can conclude that the cleaned, seeded version has the same speed as the original non-seeded version. We can therefore use the cleaned, seeded version in the rest of our comparison.

Check the speed of the LAC_TF115_CLEANED_SEEDED version in multiple.

Let's now compare the previous results with the results of the version that was translated to be compatible with tf1.15. I will run this version in both tf1.15 as well as tf2.2 and tf2.3 to see if upgrading to tf2.x changes the execution time. The script used in this comparison is LAC_TF115_CLEAN_SEEDED. To have a fair comparison, I made sure this script is compatible with both tf1 and tf2. We, however, have to keep in mind that although we use the same random seed, the tf1 and tf2 versions use a different random number generator to initialise the weights/biases and policy noise. As a result, the initial weights/biases and policy noise differs between the two versions. Further, while comparing tf1 with tf2, it is also important to note that the CPU/GPU behaviour has been changed.

Equal between versions
Env state :heavy_check_mark:
Weights/baises :x:
Policy noise :x:

TF 1.15

run Execution time
1 54.607890367507935
2 53.35913848876953
3 54.10729217529297
4 53.443671464920044
5 53.87433481216431
mean 53.878465461730954

image

TF 2.2

run Execution time
1 67.4517731666565
2 68.7173764705658
3 67.91726160049438
4 67.26602745056152
5 69.36337637901306
mean 68.14316301345825

image

TF 2.3

run Execution time
1 62.94339990615845
2 63.5884804725647
3 63.10942029953003
4 63.25592613220215
5 63.239721059799194
mean 63.2273895740509

image

Conclusion

From these reports, it looks like the tf2_3 version is somewhat faster than the tf2.2 version and the `tf1.15 version is the fastest. When we check the full profiler reports of tf1.15 and tf2.3 we see that the main difference is found in the execution time of the flatten_dict_items method of TensorFlow. Although this method is called nearly the same number of times (+-1), the execution time is longer in the tf2.3 version.

TF1.15 image

TF2.3 image

However, to see if this difference is significant or the time difference is explained by the different random number generator, we need to perform longer runs. Let's therefore, now perform five rollouts with each `1e51 steps.

TF 1.15 Long run

rollout Execution time
1 530.8941695690155
2 533.3793187141418
3 534.8895123004913
4 538.3707988262177
5 534.9078004360199
mean 534.4883199691773

image

TF 2.3 Long run

rollout Execution time
1 626.8156836032867
2 630.5112099647522
3 628.7407052516937
4 626.2916738986969
5 622.048121213913
mean 625.92796630859383

image

Conclusion

From these results, we can see that running the script using TF23 takes on average 17% longer than running with tf115. When comparing the reports, we see that this is caused by increases in the execution times of the following methods:

flatten_dict_items

Like stated before the biggest difference is found in the flatten_dict_items method, although this method is called nearly the same number of times (+-1), the execution time is longer in the tf2.3 version.

image

image

init

Although less prevalent, there is also a slow down in the _run.init method. This slow down is mainly caused by the longer execution time of the __enter__ and __exit__ functions.

I opened an issue on the TensorFlow GitHub as this performance drop is caused by the TensorFlow framework. For now, it is best that use the slower TensorFlow 2.3 when comparing the LAC_TF115_CLEANED_SEEDED and LAC_TF2_GRAPH versions.

Compare LAC_TF115_CLEANED_SEEDED (TF2.3) with LAC_TF2_CLEANED_GRAPH (TF2.3)

Now let's compare the LAC_CLEAN_TF115 and LAC_TF2_CLEANED_GRAPH versions while keeping the random seed the same. This comparison uses folders LAC_CLEAN_TF115_COMP_SEEDED and LAC_TF2_CLEANED_GRAPH and performs 1e4 steps in the environment. There should not be a difference between the two versions as the only thing that is changed is that the tfp.bijectors.Affine is replaced by a combination of tfp.bijector.Shift and tfp.bijector.Scale.

LAC_TF2_CLEANED_GRAPH (TF2.3)

run Execution time
1 59.6379292011261
2 61.701316595077515
3 60.217915773391724
4 61.281046867370605
5 59.2477707862854
mean 60.417195844650266

image

Conclusion

From the results above we can see that the LAC_TF2_CLEANED_GRAPH version indeed has the same performance as the LAC_TF115_CLEANED_SEEDED version. We can therefore use this version to compare it to the LAC_TF2_CLEANED_EAGER in which eager mode is enabled.

Compare LAC_TF2_CLEANED_GRAPH (TF2.3) with LAC_TF2_CLEANED_EAGER (TF2.3)

LAC_TF2_CLEANED_EAGER

run Execution time
1 59.69400715827942
2 59.618980884552
3 59.51754140853882
4 59.57386136054993
5 58.85715985298157
mean 59.452310132980344

image

Conclusion

From this, we can see that the new tf2.3 Eager version is as fast as the graph version. As above it is however 20% slower than the tf1.15 version (A TensorFlow issue was opened for this).

Try to improve the speed of the TF2_EAEGER version

Let's quickly try to improve the speed of the TensorFlow parts of the TF2_EAGER version script. While doing this, we do not yet improve the non-TensorFlow python code as we still want a fair comparison with TensorFlow. As explained in the documentation to speed up the computation we have to wrap the TensorFlow operations with the tf.function wrapper and cast python arguments for these functions to tensors to reduce retracing. We further have to make sure that we wrapped the tf methods with the t

After we did this, we achieved the following result:

run Execution time
1 51.152666330337524
2 50.565937757492065
3 50.199445962905884
4 50.93405842781067
5 47.83208179473877
mean 50.136838054656984

image

Conclusion

From these reports we can conclude the following things:

rickstaa commented 4 years ago

This was fixed in ccf15db79ae801f4a3e756ac9521028fc6150086.