rllab++ on tensorflow-gpu

andreafranceschetti commented 6 years ago

Hi @shaneshixiang , I am trying to run rllab++ on the gpu version of tensorflow (tensorflow-gpu 1.2.1) to speed up the agent's training. Unfortunately I haven't seen any improvement in performance, even if running the training session with

config=tf.ConfigProto(log_device_placement=True)

shows that all variables and jobs are correctly placed on the gpu. Though i have close to zero GPU load.. Running on CPU is slightly faster, but only using at most 2 cores all time.

Do you have any useful tips for speeding up the training process on GPU? It seems that ddpg-class based algos suffer the longest training time... Thank you for your help, your work on rllab++ is great!

rlbayes commented 6 years ago

Hi,

I assume you are talking about slow training of off-policy fitting of Q for Q-Prop, IPG etc.

Sorry, unfortunately I have not tested with GPUs. Indeed, the off-policy fitting of Q is generally the bottleneck for speed, if the simulation is very fast. Given the neural nets are very small, loading them on GPUs do not help much, probably even after some refactoring is done such that most compute is done as one tensorflow op.

One useful trick to speed up code for off-policy learning is to add pre-fetch for sampling from replay buffer; however, it is not part of the code at the moment.

Alternative approach could be a recent work ( https://arxiv.org/abs/1710.11198v1) that has shown on-policy fitting of Q actually can be used as an effective control variate too. I have not tried their specific version myself, but at some point plan including as comparisons.

Thank you for your interest and feedback!

On Fri, Nov 3, 2017 at 12:57 PM, SystemSigma notifications@github.com wrote:

Hi @shaneshixiang https://github.com/shaneshixiang , I am trying to run rllab++ on the gpu version of tensorflow (tensorflow-gpu 1.2.1) to speed up the agent's training. Unfortunately I haven't seen any improvement in performance, even if running the training session with

config=tf.ConfigProto(log_device_placement=True)

shows that all variables and jobs are correctly placed on the gpu. Running on CPU is slightly faster, but only using at most 2 cores all time.

Do you have any useful tips for speeding up the training process on GPU? It seems that ddpg-class based algos suffer the longest training time... Thank you for your help, your work on rllab++ is great!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/shaneshixiang/rllabplusplus/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AAU8fSKQCYcf6lZLtSiZxWOD4yIk051fks5syw3VgaJpZM4QRDL_ .

-- Shane Gu

andreafranceschetti commented 6 years ago

Thanks for the quick response @shaneshixiang . I switched back TF to CPU only computation. Yes, also fitting the critic in Q-Prop takes some time, but the main problem i am encountering is in the DDPG train method.. It takes almost 30 sec per epoch on CPU.. Furthermore I'm getting very quickly large values for the AverageAction, AverageQLoss ecc..

In order to completely describe my experiment (Hopper-v1 Mujoco-Gym Environment) I attach the log file of the run, though i left every parameters in launcher_utils.py to their default values.

debug.log

rlbayes / rllabplusplus

rllab++ on tensorflow-gpu #5