hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.14k stars 723 forks source link

SAC results with large variance #1150

Closed dibbla closed 2 years ago

dibbla commented 2 years ago

Describe the bug First thank you for implementing such great project. I'm using stable baselines to train an SAC model as expert used in GAIL. However, I found the performance of SAC agents (trained seed different seed) differ largely. However, according to the original paper of SAC, the case seems unlikely to happen.

Code example

import gym
import argparse
import sys
sys.path.append("..") 

import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

from stable_baselines.sac.policies import MlpPolicy
from stable_baselines import SAC
from stable_baselines.common import set_global_seeds

parser = argparse.ArgumentParser(description='Para')
parser.add_argument('--step', metavar="step",type=int,
                    help='how many steps to train expert',default=1000000)
parser.add_argument('--seed', metavar="seed",type=int,
                    help='seed used to be set globally',default=0)
parser.add_argument('--model_dir', type=str, default= "experts/")

args = parser.parse_args()

# Setup seeds
save_path = args.model_dir + "seed-" + str(args.seed) + "-timestep-" + str(args.step)
env = gym.make("Hopper-v2")
set_global_seeds(args.seed)
env.seed(args.seed)

  # Train
  model = SAC(MlpPolicy, env, verbose=1, seed=args.seed)
  set_global_seeds(args.seed)
  env.seed(args.seed)
  model.learn(total_timesteps=args.step, log_interval=200)
  model.save(save_path)
My result:(in Hopper-v2) seed performance
0 1539
1 3069
2 466
3 3460
42 655

And I use this for evaluate

# Evaluate
  model =SAC.load(save_path, env, seed=args.seed)
  set_global_seeds(args.seed)
  env.seed(args.seed)
  avg_r = 0
  for i in range(100):
      if i/20 == 0:
          print("|",end="|")
      r = 0
      obs = env.reset()
      env.seed(args.seed)
      print()
      print(obs)
      while True:
          action,_states = model.predict(obs)
          obs, reward, done, info = env.step(action)
          r += reward
          if done:
              break
      print("the {}th round result".format(i))
      print("the final reward is: ",r)
      avg_r += r
  print()    
  print("The avg::{}".format(avg_r/100))

System Info Describe the characteristic of your environment:

Miffyli commented 2 years ago

Large variation between runs is not uncommon (I can not say specifically for SAC, but I believe it too has the same problem). As a "automatic reply" type of answer, I would recommend two things: