Describe the bug
First thank you for implementing such great project. I'm using stable baselines to train an SAC model as expert used in GAIL. However, I found the performance of SAC agents (trained seed different seed) differ largely. However, according to the original paper of SAC, the case seems unlikely to happen.
Code example
import gym
import argparse
import sys
sys.path.append("..")
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
from stable_baselines.sac.policies import MlpPolicy
from stable_baselines import SAC
from stable_baselines.common import set_global_seeds
parser = argparse.ArgumentParser(description='Para')
parser.add_argument('--step', metavar="step",type=int,
help='how many steps to train expert',default=1000000)
parser.add_argument('--seed', metavar="seed",type=int,
help='seed used to be set globally',default=0)
parser.add_argument('--model_dir', type=str, default= "experts/")
args = parser.parse_args()
# Setup seeds
save_path = args.model_dir + "seed-" + str(args.seed) + "-timestep-" + str(args.step)
env = gym.make("Hopper-v2")
set_global_seeds(args.seed)
env.seed(args.seed)
# Train
model = SAC(MlpPolicy, env, verbose=1, seed=args.seed)
set_global_seeds(args.seed)
env.seed(args.seed)
model.learn(total_timesteps=args.step, log_interval=200)
model.save(save_path)
My result:(in Hopper-v2)
seed
performance
0
1539
1
3069
2
466
3
3460
42
655
And I use this for evaluate
# Evaluate
model =SAC.load(save_path, env, seed=args.seed)
set_global_seeds(args.seed)
env.seed(args.seed)
avg_r = 0
for i in range(100):
if i/20 == 0:
print("|",end="|")
r = 0
obs = env.reset()
env.seed(args.seed)
print()
print(obs)
while True:
action,_states = model.predict(obs)
obs, reward, done, info = env.step(action)
r += reward
if done:
break
print("the {}th round result".format(i))
print("the final reward is: ",r)
avg_r += r
print()
print("The avg::{}".format(avg_r/100))
System Info
Describe the characteristic of your environment:
Large variation between runs is not uncommon (I can not say specifically for SAC, but I believe it too has the same problem). As a "automatic reply" type of answer, I would recommend two things:
Change to stable-baselines3, if possible (more up to date code), and
Check SB3 zoo on the results and verified training and evaluation regimen.
Describe the bug First thank you for implementing such great project. I'm using stable baselines to train an SAC model as expert used in GAIL. However, I found the performance of SAC agents (trained seed different seed) differ largely. However, according to the original paper of SAC, the case seems unlikely to happen.
Code example
And I use this for evaluate
System Info Describe the characteristic of your environment: