hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.14k stars 723 forks source link

[Question] Using PPO1 on a cluster #957

Open AlessandroZavoli opened 4 years ago

AlessandroZavoli commented 4 years ago

I'm having an hard time trying to use PPO1 (or any other algorithm) on a SLURM-managed cluster.

in particular, I considered the example reported in the documentation

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO1

env = gym.make('CartPole-v1')

model = PPO1(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("ppo1_cartpole")

del model # remove to demonstrate saving and loading

model = PPO1.load("ppo1_cartpole")

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

and launched it with mpirun -n 2 python fileName.py Yet it results in the same code run twice

Did anyone ever experience these kind of issues?

araffin commented 4 years ago

Yet it results in the same code run twice

What do you mean? Sounds like it is expected, the synchronization is done when computing the gradients.

AlessandroZavoli commented 4 years ago

This is an example of a piece of the output. It seems to me that the same code is just executed twice with no synchronization

********* Iteration 39 ***********
-0.00011 | -0.00607 | 127.74117 | 2.85e-06 | 0.60663
------------------------------------
| EpLenMean | 97 |
| EpRewMean | 97 |
| EpThisIter | 1 |
| EpisodesSoFar | 102 |
| TimeElapsed | 20.1 |
| TimestepsSoFar | 9984 |
| ev_tdlam_before | -0.00719 |
| loss_ent | 0.60663414 |
| loss_kl | 2.846489e-06 |
| loss_pol_entpen | -0.0060663414 |
| loss_pol_surr | -0.00011206418 |
| loss_vf_loss | 127.74117 |
------------------------------------
********* Iteration 39 ***********
Optimizing...
pol_surr | pol_entpen | vf_loss | kl | ent
-3.17e-08 | -0.00590 | 108.54033 | 7.48e-10 | 0.58984
-9.48e-06 | -0.00590 | 108.53023 | 3.53e-09 | 0.58985
Optimizing...
pol_surr | pol_entpen | vf_loss | kl | ent
-1.22e-05 | -0.00590 | 108.51881 | 7.80e-09 | 0.58985
-3.13e-06 | -0.00588 | 114.89399 | 4.77e-09 | 0.58785
-1.34e-05 | -0.00590 | 108.50693 | 7.38e-09 | 0.58986
Evaluating losses...
-2.34e-05 | -0.00588 | 114.88756 | 1.50e-08 | 0.58784
-1.84e-05 | -0.00590 | 108.49850 | 8.54e-09 | 0.58986
------------------------------------
| EpLenMean | 99.8 |
| EpRewMean | 99.8 |
| EpThisIter | 1 |
| EpisodesSoFar | 102 |
| TimeElapsed | 20.4 |
| TimestepsSoFar | 10240 |
| ev_tdlam_before | 0.763 |
| loss_ent | 0.5898608 |
| loss_kl | 8.537363e-09 |
| loss_pol_entpen | -0.005898608 |
| loss_pol_surr | -1.8380582e-05 |
| loss_vf_loss | 108.498505 |
------------------------------------
-1.61e-05 | -0.00588 | 114.88094 | 2.55e-08 | 0.58784
-2.33e-05 | -0.00588 | 114.87449 | 2.28e-08 | 0.58785
Evaluating losses...
-2.83e-05 | -0.00588 | 114.87064 | 1.36e-08 | 0.58786
------------------------------------