rickstaa commented 4 years ago

User story

Since we want to be able to make the PyTorch version as fast as the TF1 version it might be a good idea to start why the TF2 version is slower than the tf1 version.

Resources

Main takeaways

Make sure your tf.functions have tensors as their input. If your function accepts Python numeric types each new value actually leads to a new input signature and thus a new graph!!

Difference between the versions

TF115

In TF115 tfp.distributions.ConditionalTransformedDistribution is replaced by tfp.distributions.TransformedDistribution.

TF2

In TF2 tfp.bijectors.Affine is replaced by a combination of tfp.bijector.Shift and tfp.bijector.Scale.

Possible causes

[ ] Tf tensor conversion plus graph creation overhead because we input python types to the tf.functions.
[ ] Gradient computation time.
[ ] Simulation rollout time
[ ] Simulation step function
[ ] Numerical precission during the computation (Input is already float32).
[ ] Policy sampling.

Tools

rickstaa commented 4 years ago

Compare speed between version

I tried to do this while training to keep:

[x] Seeds equal.
[x] Initial states equal.
[x] Actions from the policy equal.
[x] Replay buffer batches are equal.

Because of a change in Random Number Generator of the GlorotUniform initialiser, this is however not possible when comparing tf1.15<= versions with tf>=1.15 without significantly rewriting one of the versions. As a result, the weights of the networks are different, and we, therefore, do not have a 100% fair comparison. I, however, do not expect this difference in network weight initiation to cause a significant difference in running time.

LAC_TF1_ORIGINAL

Tensorflow 1.13

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	Time: 76.22983193397522
2	Time: 79.77554988861084
3	Time: 81.70465159416199
4	Time: 69.69277858734131

The Cprofiler log file and results are:

tf1.zip

GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	110.71875476837158
2	103.70752668380737
3	101.92001938819885
4	105.97694063186646

The Cprofiler log file and results are:

tf1_original_gpu.zip

LAC_TF1_ORIGINAL_CLEANED

Tensorflow 1.13.0

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	72.85364580154419
2	79.3347327709198
3	69.82149505615234
4	71.70584893226624

The Cprofiler log file and results are:

tf1_cleaned.zip

Tensorflow 1.15.0

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	75.8878424167633
2	80.95311832427979
3	71.45233821868896
4	73.24926376342773

The Cprofiler log file and results are:

tf115_short_run_2.zip

LAC_TF2_GRAPH

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	90.4486939907074
2	80.69330954551697
3	96.18579578399658
4	91.07221174240112

The Cprofiler log file is:

tf2_cleaned_graph.zip

Tensorflow 2.3

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	82.71706700325012
2	86.21673941612244
3	88.57738018035889
4	85.81253170967102

The Cprofiler log file is:

tf2_cleaned_graph.zip

LAC_TF2_EAGER

Tensorflow 2.2

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	97.44767260551453
2	128.99831318855286
3	105.46797251701355
4	110.24135160446167

The Cprofiler log file is:

tf2_cleaned_eager_2_2.zip

GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	163.80216550827026
2	215.6815378665924
3	204.68237948417664
4	132.58426022529602

The Cprofiler log file is:

tf2_cleaned_eager_2_2_gpu.zip

Tensorflow 2.3

No GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	145.21399974822998
2	150.8920443058014
3	143.57212162017822
4	142.8047535419464

The Cprofiler log file is:

tf2_cleaned_eager_2_3.zip

GPU

When performing 4 rollouts of each 1e4 steps the stdout times are:

Rollout	Time
1	249.57833003997803
2	255.32116079330444
3	233.76185131072998
4	241.26305603981018

The Cprofiler log file is:

tf2_cleaned_eager_2_3_gpu.zip

Conclusion

From this, we can see that there is a high variance in the CPU load, speed and execution time. In all versions, because of the small size of our network, the GPU version is much slower. Further speed seems to decrease drastically when we move from the TF2_GRAPH version to the TF2_EAGER version. Strangely the speed decreases further if we update from tf 2.2 to tf 2.3.

The difference between the differen Graph based versions is partly caused by the way how the script is stoped when the max_global_step has been reached. Currently the script doesn't stop right away but performs more steps until the episode is over. This has to be fixed in our further comparison.

Some of the difference between the GRAPH and EAER version is caused because the noise inside the Gaussian actor is seeded differently. Due to the difference between the new Eager (object orientation) and old Graph versions we can not easily sample this noise with the same random seed. Currently in the Graph version when we create the graph of the next_state through the Gaussian actor the noise sampler is seeded again wherein the Eager version the same sampler is used:

lya_a_, _, _ = self._build_a(self.S_, reuse=True, seeds=self.ga_seeds)

I won't do this since this is a lot of work and I don't suspect this to cause the big difference in execution time.

rickstaa commented 4 years ago

Inspect the effect of using a different random seed

Let's check how much changing the random seed affects the runtime.

LAC_CLEAN_TF115 (TF1.15)

RANDOM_SEED=0

run	Execution time
1	61.93721580505371
2	64.42950749397278
3	61.094213247299194
mean	62.48697884877523

RANDOM_SEED=6

run	Execution time
1	61.93721580505371
2	64.42950749397278
3	61.094213247299194
mean	62.48697884877523

RANDOM_SEED=56	run	Execution time
1	70.52894282341003
2	73.83393502235413
3	67.12420463562012
mean	70.49569416046143

RANDOM_SEED=1180	run	Execution time
1	71.90123772621155
2	64.69407534599304
3	70.29362034797668
mean	68.9629778067271

RANDOM_SEED=0	run	Execution time
1	67.44287109375
2	64.12875771522522
3	66.90567302703857
mean	66.33816663424174

RANDOM_SEED=4	run	Execution time
1	64.22623944282532
2	70.38289451599121
3	68.56257486343384
mean	67.72390294075012

LAC_CLEAN_TF115 (TF2.3)

run	Execution time
1	65.69019556045532
2	69.83095479011536
3	73.19888234138489
mean	69.57334423065186

LAC_TF2_CLEANED_GRAPH (TF2.3)

run	Execution time
1	75.48882007598877
2	78.04177689552307
3	78.71057200431824
mean	77.41372299194336

Conclusion

From this it seems that changing the seed which changes the initial environment state, initial weights and biases and policy noise sampling sequence does affect the total run time. We, however, have to keep in mind that we calculated the mean over only 3 runs and therefore not be 100% certain this difference in execution time is not caused by the internal variance. We will use the same random seed in our further comparison just to be sure.

rickstaa commented 4 years ago

Inspect possible speed differences between the Graph and Eager based Tensorflow versions

Let's first compare the runtimes of all the different GRAPH based TF versions:

LAC_TF1_ORIGINAL: The original version as received from Wei. This version works with tf1.13. In this only a random seed is set for the environment and the script random number generators.
LAC_TF1_ORIGINAL_CLEANED: A cleaned-up version of the original script. Also works with tf1.13.
LAC_TF1_ORIGINAL_SEEDED: Same as LAC_TF1_ORIGINAL but now also the initial weights/biases and policy noise has been seeded.
LAC_TF1_ORIGINAL_CLEANED_SEEDED: Same as LAC_TF1_ORIGINAL_CLEANED but now also the initial weights/biases and policy noise has been seeded.
LAC_TF2_CLEANED_EAGER: A to TF2 translated version in which eager execution is enabled.
LAC_TF2_CLEANED_GRAPH: A to TF2 translated version but now with eager execution disabled.
LAC_TF115_CLEANED: A to tf1.15 translated version of the original LAC_TF1_ORIGINAL version.
LAC_TF115_CLEANED_SEEDED: A to tf1.15 translated version in which the random seeds for initialising the env, weights and noise are also set.
LAC_TF115_CLEANED_TF2_COMP: Similar to the LAC_TF115_CLEANED script but now some small changes have been made to be able to run this script with tf2.2 and tf2.3.
LAC_TF115_CLEANED_TF2_COMP_SEEDED: Similar to the above script but now the random seeds for initialising the env, weights and noise are set.

While doing so, we have to keep in mind that the final runtime depends on:

Whether the calculations are performed on the GPU or CPU. For our small network performing calculations on the CPU is faster (See the previous report).
The load that is currently on the CPU. I closed all other programs while performing the tests.
The initial state of the environment.
The initial weights and biases.
The random seed from which the Policy noise is sampled.
The number of evaluation paths we use during the performance evaluation.

As a result, during this comparison, I made sure that:

[x] Random seeds equal.
[x] Initial states equal.
[x] Actions from the policy equal.
[x] Replay buffer batches are equal.

Make scripts more deterministic

Fix max_global_step breakout

First, To be able to compare the scripts, we have to make it more deterministic. Currently, the code that breaks the training loop based on the set max_global_step is placed before the episode loop:

if global_step > ENV_PARAMS["max_global_steps"]:
    print(f"Training stopped after {global_step} steps.")
    break

As a result, the steps perform in different runs differ. We, therefore, move this code inside the episode loop. Also the number of evaluation paths that are used.

Set the num_of_evaluation_paths to zero

To be able to make a fair comparison, for now, it is better to disable the test performance evaluation. We can later enable this again if we compare the Tensorflow and Pytorch version. To disable the test performance evaluation, we can we have to set the num_of_evaluation_paths to zero. By doing this, we get rid of execution time differences that are caused by the fact that the policy has not yet converged with the small number of epochs we use. If we keep it enabled a suboptimal policy causes the agent to reach the terminal state in the test environment faster --> fewer steps will be performed --> meaning a lower execution time.

Make sure that the GPU is disabled in both tf1 and tf2

Where in tf1.15 you had to download the TensorFlow-GPU package to be able to run your computations on the GPU, in tf2.x GPU computations are enabled when a GPU is available. For a fair comparison in tf2.x we, therefore, have to disable the GPU:

# Disable GPU if requested
if not USE_GPU:
    tf.config.set_visible_devices([], "GPU")
    print("Tensorflow is using CPU")
else:
    print("Tensorflow is using GPU")

Compare LAC_TF1_ORIGINAL with LAC_TF1_CLEANED

Non-seeded version

	Equal between versions
Env state	:x:
Weights/baises	:x:
Policy noise	:x:

LAC_TF1_ORIGINAL (TF 1.13)

run	Execution time
1	60.41492462158203
2	56.50747847557068
3	57.71708631515503
4	57.85236382484436
5	56.82906222343445
mean	57.86418309211731

LAC_TF1_ORIGINAL_CLEANED (TF 1.13)

run	Execution time
1	56.68896174430847
2	60.61775994300842
3	58.21556615829468
4	57.84604740142822
5	57.93125033378601
mean	58.2428765296936

Seeded versions

	Equal between versions
Env state	:heavy_check_mark:
Weights/baises	:heavy_check_mark:
Policy noise	:heavy_check_mark:

LAC_TF1_ORIGINAL_SEEDED (TF 1.13)

ENV_SEED = 0 RANDOM_SEED = 0

run	Execution time
1	58.13782048225403
2	57.97405934333801
3	57.232763051986694
4	56.710532665252686
5	57.24989628791809
mean	57.461014366149904

LAC_TF1_ORIGINAL_CLEANED_SEEDED (TF 1.13)

ENV_SEED = 0 RANDOM_SEED = 0

run	Execution time
1	58.51278495788574
2	56.43364667892456
3	57.122565269470215
4	57.078187704086304
5	57.13237237930298
mean	57.25591139793396

Conclusion

From the above results, we can conclude that the cleaned, seeded version has the same speed as the original non-seeded version. We can therefore use the cleaned, seeded version in the rest of our comparison.

Check the speed of the LAC_TF115_CLEANED_SEEDED version in multiple.

Let's now compare the previous results with the results of the version that was translated to be compatible with tf1.15. I will run this version in both tf1.15 as well as tf2.2 and tf2.3 to see if upgrading to tf2.x changes the execution time. The script used in this comparison is LAC_TF115_CLEAN_SEEDED. To have a fair comparison, I made sure this script is compatible with both tf1 and tf2. We, however, have to keep in mind that although we use the same random seed, the tf1 and tf2 versions use a different random number generator to initialise the weights/biases and policy noise. As a result, the initial weights/biases and policy noise differs between the two versions. Further, while comparing tf1 with tf2, it is also important to note that the CPU/GPU behaviour has been changed.

	Equal between versions
Env state	:heavy_check_mark:
Weights/baises	:x:
Policy noise	:x:

TF 1.15

run	Execution time
1	54.607890367507935
2	53.35913848876953
3	54.10729217529297
4	53.443671464920044
5	53.87433481216431
mean	53.878465461730954

TF 2.2

run	Execution time
1	67.4517731666565
2	68.7173764705658
3	67.91726160049438
4	67.26602745056152
5	69.36337637901306
mean	68.14316301345825

TF 2.3

run	Execution time
1	62.94339990615845
2	63.5884804725647
3	63.10942029953003
4	63.25592613220215
5	63.239721059799194
mean	63.2273895740509

Conclusion

From these reports, it looks like the tf2_3 version is somewhat faster than the tf2.2 version and the `tf1.15 version is the fastest. When we check the full profiler reports of tf1.15 and tf2.3 we see that the main difference is found in the execution time of the flatten_dict_items method of TensorFlow. Although this method is called nearly the same number of times (+-1), the execution time is longer in the tf2.3 version.

TF1.15

TF2.3

However, to see if this difference is significant or the time difference is explained by the different random number generator, we need to perform longer runs. Let's therefore, now perform five rollouts with each `1e51 steps.

TF 1.15 Long run

rollout	Execution time
1	530.8941695690155
2	533.3793187141418
3	534.8895123004913
4	538.3707988262177
5	534.9078004360199
mean	534.4883199691773

TF 2.3 Long run

rollout	Execution time
1	626.8156836032867
2	630.5112099647522
3	628.7407052516937
4	626.2916738986969
5	622.048121213913
mean	625.92796630859383

Conclusion

From these results, we can see that running the script using TF23 takes on average 17% longer than running with tf115. When comparing the reports, we see that this is caused by increases in the execution times of the following methods:

flatten_dict_items

Like stated before the biggest difference is found in the flatten_dict_items method, although this method is called nearly the same number of times (+-1), the execution time is longer in the tf2.3 version.

init

Although less prevalent, there is also a slow down in the _run.init method. This slow down is mainly caused by the longer execution time of the __enter__ and __exit__ functions.

I opened an issue on the TensorFlow GitHub as this performance drop is caused by the TensorFlow framework. For now, it is best that use the slower TensorFlow 2.3 when comparing the LAC_TF115_CLEANED_SEEDED and LAC_TF2_GRAPH versions.

Compare LAC_TF115_CLEANED_SEEDED (TF2.3) with LAC_TF2_CLEANED_GRAPH (TF2.3)

Now let's compare the LAC_CLEAN_TF115 and LAC_TF2_CLEANED_GRAPH versions while keeping the random seed the same. This comparison uses folders LAC_CLEAN_TF115_COMP_SEEDED and LAC_TF2_CLEANED_GRAPH and performs 1e4 steps in the environment. There should not be a difference between the two versions as the only thing that is changed is that the tfp.bijectors.Affine is replaced by a combination of tfp.bijector.Shift and tfp.bijector.Scale.

LAC_TF2_CLEANED_GRAPH (TF2.3)

run	Execution time
1	59.6379292011261
2	61.701316595077515
3	60.217915773391724
4	61.281046867370605
5	59.2477707862854
mean	60.417195844650266

Conclusion

From the results above we can see that the LAC_TF2_CLEANED_GRAPH version indeed has the same performance as the LAC_TF115_CLEANED_SEEDED version. We can therefore use this version to compare it to the LAC_TF2_CLEANED_EAGER in which eager mode is enabled.

Compare LAC_TF2_CLEANED_GRAPH (TF2.3) with LAC_TF2_CLEANED_EAGER (TF2.3)

LAC_TF2_CLEANED_EAGER

run	Execution time
1	59.69400715827942
2	59.618980884552
3	59.51754140853882
4	59.57386136054993
5	58.85715985298157
mean	59.452310132980344

Conclusion

From this, we can see that the new tf2.3 Eager version is as fast as the graph version. As above it is however 20% slower than the tf1.15 version (A TensorFlow issue was opened for this).

Try to improve the speed of the TF2_EAEGER version

Let's quickly try to improve the speed of the TensorFlow parts of the TF2_EAGER version script. While doing this, we do not yet improve the non-TensorFlow python code as we still want a fair comparison with TensorFlow. As explained in the documentation to speed up the computation we have to wrap the TensorFlow operations with the tf.function wrapper and cast python arguments for these functions to tensors to reduce retracing. We further have to make sure that we wrapped the tf methods with the t

After we did this, we achieved the following result:

run	Execution time
1	51.152666330337524
2	50.565937757492065
3	50.199445962905884
4	50.93405842781067
5	47.83208179473877
mean	50.136838054656984

Conclusion

From these reports we can conclude the following things:

With our small network performing the calculations on the GPU is slower than doing them on the CPU.
The cleaned up tensorflow script is slower in tf2.3 than in tf1.15 when eager mode is disabled. This is due to a performance issue in Tensorflow. An issue has been opened for this on the tensorflow repository.
When we use tf2.3 in eager mode the script is slightly faster than the old original LAC script.

rickstaa commented 4 years ago

This was fixed in ccf15db79ae801f4a3e756ac9521028fc6150086.

rickstaa / LAC-TF2-TORCH-translation

TF2 version slower than TF1 version #17

User story

Resources

Main takeaways

Difference between the versions

TF115

TF2

Possible causes

Tools

Compare speed between version

LAC_TF1_ORIGINAL

Tensorflow 1.13

No GPU

GPU

LAC_TF1_ORIGINAL_CLEANED

Tensorflow 1.13.0

Tensorflow 1.15.0

LAC_TF2_GRAPH

No GPU

Tensorflow 2.3

No GPU

LAC_TF2_EAGER

Tensorflow 2.2

No GPU

GPU

Tensorflow 2.3

No GPU

GPU

Conclusion

Inspect the effect of using a different random seed

LAC_CLEAN_TF115 (TF1.15)

LAC_CLEAN_TF115 (TF2.3)

LAC_TF2_CLEANED_GRAPH (TF2.3)

Conclusion

Inspect possible speed differences between the Graph and Eager based Tensorflow versions

Make scripts more deterministic

Fix max_global_step breakout

Set the num_of_evaluation_paths to zero

Make sure that the GPU is disabled in both tf1 and tf2

Compare LAC_TF1_ORIGINAL with LAC_TF1_CLEANED

Non-seeded version

LAC_TF1_ORIGINAL (TF 1.13)

LAC_TF1_ORIGINAL_CLEANED (TF 1.13)

Seeded versions

LAC_TF1_ORIGINAL_SEEDED (TF 1.13)

LAC_TF1_ORIGINAL_CLEANED_SEEDED (TF 1.13)

Conclusion

Check the speed of the LAC_TF115_CLEANED_SEEDED version in multiple.

TF 1.15

TF 2.2

TF 2.3

Conclusion

TF 1.15 Long run

TF 2.3 Long run

Conclusion

Compare LAC_TF115_CLEANED_SEEDED (TF2.3) with LAC_TF2_CLEANED_GRAPH (TF2.3)

LAC_TF2_CLEANED_GRAPH (TF2.3)

Conclusion

Compare LAC_TF2_CLEANED_GRAPH (TF2.3) with LAC_TF2_CLEANED_EAGER (TF2.3)

LAC_TF2_CLEANED_EAGER

Conclusion

Try to improve the speed of the TF2_EAEGER version

Conclusion