Closed rickstaa closed 4 years ago
I tried to do this while training to keep:
Because of a change in Random Number Generator of the GlorotUniform initialiser, this is however not possible when comparing tf1.15<=
versions with tf>=1.15
without significantly rewriting one of the versions. As a result, the weights of the networks are different, and we, therefore, do not have a 100% fair comparison. I, however, do not expect this difference in network weight initiation to cause a significant difference in running time.
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | Time: 76.22983193397522 |
2 | Time: 79.77554988861084 |
3 | Time: 81.70465159416199 |
4 | Time: 69.69277858734131 |
The Cprofiler log file and results are:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 110.71875476837158 |
2 | 103.70752668380737 |
3 | 101.92001938819885 |
4 | 105.97694063186646 |
The Cprofiler log file and results are:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 72.85364580154419 |
2 | 79.3347327709198 |
3 | 69.82149505615234 |
4 | 71.70584893226624 |
The Cprofiler log file and results are:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 75.8878424167633 |
2 | 80.95311832427979 |
3 | 71.45233821868896 |
4 | 73.24926376342773 |
The Cprofiler log file and results are:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 90.4486939907074 |
2 | 80.69330954551697 |
3 | 96.18579578399658 |
4 | 91.07221174240112 |
The Cprofiler log file is:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 82.71706700325012 |
2 | 86.21673941612244 |
3 | 88.57738018035889 |
4 | 85.81253170967102 |
The Cprofiler log file is:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 97.44767260551453 |
2 | 128.99831318855286 |
3 | 105.46797251701355 |
4 | 110.24135160446167 |
The Cprofiler log file is:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 163.80216550827026 |
2 | 215.6815378665924 |
3 | 204.68237948417664 |
4 | 132.58426022529602 |
The Cprofiler log file is:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 145.21399974822998 |
2 | 150.8920443058014 |
3 | 143.57212162017822 |
4 | 142.8047535419464 |
The Cprofiler log file is:
When performing 4 rollouts of each 1e4
steps the stdout times are:
Rollout | Time |
---|---|
1 | 249.57833003997803 |
2 | 255.32116079330444 |
3 | 233.76185131072998 |
4 | 241.26305603981018 |
The Cprofiler log file is:
From this, we can see that there is a high variance in the CPU load, speed and execution time. In all versions, because of the small size of our network, the GPU version is much slower. Further speed seems to decrease drastically when we move from the TF2_GRAPH version to the TF2_EAGER version. Strangely the speed decreases further if we update from tf 2.2
to tf 2.3
.
The difference between the differen Graph based versions is partly caused by the way how the script is stoped when the max_global_step
has been reached. Currently the script doesn't stop right away but performs more steps until the episode is over. This has to be fixed in our further comparison.
Some of the difference between the GRAPH and EAER version is caused because the noise inside the Gaussian actor is seeded differently. Due to the difference between the new Eager (object orientation) and old Graph versions we can not easily sample this noise with the same random seed. Currently in the Graph version when we create the graph of the next_state
through the Gaussian actor the noise sampler is seeded again wherein the Eager version the same sampler is used:
lya_a_, _, _ = self._build_a(self.S_, reuse=True, seeds=self.ga_seeds)
I won't do this since this is a lot of work and I don't suspect this to cause the big difference in execution time.
Let's check how much changing the random seed affects the runtime.
RANDOM_SEED=0
run | Execution time |
---|---|
1 | 61.93721580505371 |
2 | 64.42950749397278 |
3 | 61.094213247299194 |
mean | 62.48697884877523 |
RANDOM_SEED=6
run | Execution time |
---|---|
1 | 61.93721580505371 |
2 | 64.42950749397278 |
3 | 61.094213247299194 |
mean | 62.48697884877523 |
RANDOM_SEED=56 | run | Execution time |
---|---|---|
1 | 70.52894282341003 | |
2 | 73.83393502235413 | |
3 | 67.12420463562012 | |
mean | 70.49569416046143 |
RANDOM_SEED=1180 | run | Execution time |
---|---|---|
1 | 71.90123772621155 | |
2 | 64.69407534599304 | |
3 | 70.29362034797668 | |
mean | 68.9629778067271 |
RANDOM_SEED=0 | run | Execution time |
---|---|---|
1 | 67.44287109375 | |
2 | 64.12875771522522 | |
3 | 66.90567302703857 | |
mean | 66.33816663424174 |
RANDOM_SEED=4 | run | Execution time |
---|---|---|
1 | 64.22623944282532 | |
2 | 70.38289451599121 | |
3 | 68.56257486343384 | |
mean | 67.72390294075012 |
run | Execution time |
---|---|
1 | 65.69019556045532 |
2 | 69.83095479011536 |
3 | 73.19888234138489 |
mean | 69.57334423065186 |
run | Execution time |
---|---|
1 | 75.48882007598877 |
2 | 78.04177689552307 |
3 | 78.71057200431824 |
mean | 77.41372299194336 |
From this it seems that changing the seed which changes the initial environment state, initial weights and biases and policy noise sampling sequence does affect the total run time. We, however, have to keep in mind that we calculated the mean over only 3 runs and therefore not be 100% certain this difference in execution time is not caused by the internal variance. We will use the same random seed in our further comparison just to be sure.
Let's first compare the runtimes of all the different GRAPH based TF versions:
tf1.13
. In this only a random seed is set for the environment and the script random number generators.tf1.13
.LAC_TF1_ORIGINAL
but now also the initial weights/biases and policy noise has been seeded.LAC_TF1_ORIGINAL_CLEANED
but now also the initial weights/biases and policy noise has been seeded. tf1.15
translated version of the original LAC_TF1_ORIGINAL
version.tf1.15
translated version in which the random seeds for initialising the env, weights and noise are also set.LAC_TF115_CLEANED
script but now some small changes have been made to be able to run this script with tf2.2
and tf2.3
.While doing so, we have to keep in mind that the final runtime depends on:
As a result, during this comparison, I made sure that:
First, To be able to compare the scripts, we have to make it more deterministic. Currently, the code that breaks the training loop based on the set max_global_step
is placed before the episode loop:
if global_step > ENV_PARAMS["max_global_steps"]:
print(f"Training stopped after {global_step} steps.")
break
As a result, the steps perform in different runs differ. We, therefore, move this code inside the episode loop. Also the number of evaluation paths that are used.
To be able to make a fair comparison, for now, it is better to disable the test performance evaluation. We can later enable this again if we compare the Tensorflow and Pytorch version. To disable the test performance evaluation, we can we have to set the num_of_evaluation_paths
to zero. By doing this, we get rid of execution time differences that are caused by the fact that the policy has not yet converged with the small number of epochs we use. If we keep it enabled a suboptimal policy causes the agent to reach the terminal state in the test environment faster --> fewer steps will be performed --> meaning a lower execution time.
Where in tf1.15
you had to download the TensorFlow-GPU
package to be able to run your computations on the GPU, in tf2.x
GPU computations are enabled when a GPU is available. For a fair comparison in tf2.x
we, therefore, have to disable the GPU:
# Disable GPU if requested
if not USE_GPU:
tf.config.set_visible_devices([], "GPU")
print("Tensorflow is using CPU")
else:
print("Tensorflow is using GPU")
Equal between versions | |
---|---|
Env state | :x: |
Weights/baises | :x: |
Policy noise | :x: |
run | Execution time |
---|---|
1 | 60.41492462158203 |
2 | 56.50747847557068 |
3 | 57.71708631515503 |
4 | 57.85236382484436 |
5 | 56.82906222343445 |
mean | 57.86418309211731 |
run | Execution time |
---|---|
1 | 56.68896174430847 |
2 | 60.61775994300842 |
3 | 58.21556615829468 |
4 | 57.84604740142822 |
5 | 57.93125033378601 |
mean | 58.2428765296936 |
Equal between versions | |
---|---|
Env state | :heavy_check_mark: |
Weights/baises | :heavy_check_mark: |
Policy noise | :heavy_check_mark: |
ENV_SEED = 0 RANDOM_SEED = 0
run | Execution time |
---|---|
1 | 58.13782048225403 |
2 | 57.97405934333801 |
3 | 57.232763051986694 |
4 | 56.710532665252686 |
5 | 57.24989628791809 |
mean | 57.461014366149904 |
ENV_SEED = 0 RANDOM_SEED = 0
run | Execution time |
---|---|
1 | 58.51278495788574 |
2 | 56.43364667892456 |
3 | 57.122565269470215 |
4 | 57.078187704086304 |
5 | 57.13237237930298 |
mean | 57.25591139793396 |
From the above results, we can conclude that the cleaned, seeded version has the same speed as the original non-seeded version. We can therefore use the cleaned, seeded version in the rest of our comparison.
Let's now compare the previous results with the results of the version that was translated to be compatible with tf1.15
. I will run this version in both tf1.15
as well as tf2.2
and tf2.3
to see if upgrading to tf2.x
changes the execution time. The script used in this comparison is LAC_TF115_CLEAN_SEEDED
. To have a fair comparison, I made sure this script is compatible with both tf1
and tf2
. We, however, have to keep in mind that although we use the same random seed, the tf1
and tf2
versions use a different random number generator to initialise the weights/biases and policy noise. As a result, the initial weights/biases and policy noise differs between the two versions. Further, while comparing tf1
with tf2
, it is also important to note that the CPU/GPU behaviour has been changed.
Equal between versions | |
---|---|
Env state | :heavy_check_mark: |
Weights/baises | :x: |
Policy noise | :x: |
run | Execution time |
---|---|
1 | 54.607890367507935 |
2 | 53.35913848876953 |
3 | 54.10729217529297 |
4 | 53.443671464920044 |
5 | 53.87433481216431 |
mean | 53.878465461730954 |
run | Execution time |
---|---|
1 | 67.4517731666565 |
2 | 68.7173764705658 |
3 | 67.91726160049438 |
4 | 67.26602745056152 |
5 | 69.36337637901306 |
mean | 68.14316301345825 |
run | Execution time |
---|---|
1 | 62.94339990615845 |
2 | 63.5884804725647 |
3 | 63.10942029953003 |
4 | 63.25592613220215 |
5 | 63.239721059799194 |
mean | 63.2273895740509 |
From these reports, it looks like the tf2_3
version is somewhat faster than the tf2.2
version and the `tf1.15
version is the fastest. When we check the full profiler reports of tf1.15
and tf2.3
we see that the main difference is found in the execution time of the flatten_dict_items
method of TensorFlow. Although this method is called nearly the same number of times (+-1), the execution time is longer in the tf2.3
version.
TF1.15
TF2.3
However, to see if this difference is significant or the time difference is explained by the different random number generator, we need to perform longer runs. Let's therefore, now perform five rollouts with each `1e51 steps.
rollout | Execution time |
---|---|
1 | 530.8941695690155 |
2 | 533.3793187141418 |
3 | 534.8895123004913 |
4 | 538.3707988262177 |
5 | 534.9078004360199 |
mean | 534.4883199691773 |
rollout | Execution time |
---|---|
1 | 626.8156836032867 |
2 | 630.5112099647522 |
3 | 628.7407052516937 |
4 | 626.2916738986969 |
5 | 622.048121213913 |
mean | 625.92796630859383 |
From these results, we can see that running the script using TF23
takes on average 17% longer than running with tf115
. When comparing the reports, we see that this is caused by increases in the execution times of the following methods:
flatten_dict_items
Like stated before the biggest difference is found in the flatten_dict_items
method, although this method is called nearly the same number of times (+-1), the execution time is longer in the tf2.3
version.
init
Although less prevalent, there is also a slow down in the _run.init method. This slow down is mainly caused by the longer execution time of the __enter__
and __exit__
functions.
I opened an issue on the TensorFlow GitHub as this performance drop is caused by the TensorFlow framework. For now, it is best that use the slower TensorFlow 2.3
when comparing the LAC_TF115_CLEANED_SEEDED
and LAC_TF2_GRAPH
versions.
Now let's compare the LAC_CLEAN_TF115
and LAC_TF2_CLEANED_GRAPH
versions while keeping the random seed the same. This comparison uses folders LAC_CLEAN_TF115_COMP_SEEDED
and LAC_TF2_CLEANED_GRAPH
and performs 1e4
steps in the environment. There should not be a difference between the two versions as the only thing that is changed is that the tfp.bijectors.Affine is replaced by a combination of tfp.bijector.Shift and tfp.bijector.Scale.
run | Execution time |
---|---|
1 | 59.6379292011261 |
2 | 61.701316595077515 |
3 | 60.217915773391724 |
4 | 61.281046867370605 |
5 | 59.2477707862854 |
mean | 60.417195844650266 |
From the results above we can see that the LAC_TF2_CLEANED_GRAPH
version indeed has the same performance as the LAC_TF115_CLEANED_SEEDED
version. We can therefore use this version to compare it to the LAC_TF2_CLEANED_EAGER
in which eager mode is enabled.
run | Execution time |
---|---|
1 | 59.69400715827942 |
2 | 59.618980884552 |
3 | 59.51754140853882 |
4 | 59.57386136054993 |
5 | 58.85715985298157 |
mean | 59.452310132980344 |
From this, we can see that the new tf2.3
Eager version is as fast as the graph version. As above it is however 20% slower than the tf1.15
version (A TensorFlow issue was opened for this).
Let's quickly try to improve the speed of the TensorFlow parts of the TF2_EAGER
version script. While doing this, we do not yet improve the non-TensorFlow python code as we still want a fair comparison with TensorFlow. As explained in the documentation to speed up the computation we have to wrap the TensorFlow operations with the tf.function wrapper and cast python arguments for these functions to tensors to reduce retracing. We further have to make sure that we wrapped the tf methods with the t
After we did this, we achieved the following result:
run | Execution time |
---|---|
1 | 51.152666330337524 |
2 | 50.565937757492065 |
3 | 50.199445962905884 |
4 | 50.93405842781067 |
5 | 47.83208179473877 |
mean | 50.136838054656984 |
From these reports we can conclude the following things:
tf2.3
than in tf1.15
when eager mode is disabled. This is due to a performance issue in Tensorflow. An issue has been opened for this on the tensorflow repository.tf2.3
in eager mode the script is slightly faster than the old original LAC script.This was fixed in ccf15db79ae801f4a3e756ac9521028fc6150086.
User story
Since we want to be able to make the PyTorch version as fast as the TF1 version it might be a good idea to start why the TF2 version is slower than the tf1 version.
Resources
Main takeaways
Difference between the versions
TF115
In TF115
tfp.distributions.ConditionalTransformedDistribution
is replaced by tfp.distributions.TransformedDistribution.TF2
In TF2 tfp.bijectors.Affine is replaced by a combination of tfp.bijector.Shift and tfp.bijector.Scale.
Possible causes
Tools