quantumiracle / Popular-RL-Algorithms

PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..
Apache License 2.0
1.14k stars 129 forks source link

Why does PPO every training result in the same reward chart? This puzzles me very much. #48

Closed Alexzzdfjcn closed 3 years ago

Alexzzdfjcn commented 3 years ago

question

quantumiracle commented 3 years ago

Hi, Please provide more details about the code you used. Did you take multiple rounds of training with the same run? If so, using exactly the same plt.figure will resulting in multiple curves on the same plot.

Alexzzdfjcn commented 3 years ago

`def train(): env = gym.make(ENV_NAME).unwrapped state_dim = env.observation_space.shape[0] action_dim = env.action_space.shape[0] drawer = Drawer()

reproducible

#env.seed(RANDOMSEED)
#np.random.seed(RANDOMSEED)
#torch.manual_seed(RANDOMSEED)

ppo = PPO(state_dim, action_dim, method = METHOD)
global all_ep_r, update_plot, stop_plot, Angle, OPT_ANGLE
all_ep_r = []
Angle = []
OPT_ANGLE = []
for ep in range(EP_MAX):
    s = env.reset()
    ep_r = 0
    t0 = time.time()
    for t in range(EP_LEN):
        if RENDER:
            env.render()
        a = ppo.choose_action(s)
        ti = time.time()
        s_, S_temp, r, done, _ = env.step(s,a,ti)  #px
        ppo.store_transition(s, a, (r + 8) / 8)  # useful for pendulum since the nets are very small, 
        #normalization make it easier to learn
        s = s_
        ep_r += r
        angle, speed, height = s #px
        # update ppo
        if len(ppo.state_buffer) == BATCH_SIZE:
            ppo.finish_path(s_, done)
            ppo.update()
        if done:
            break
    ppo.finish_path(s_, done)
    print(
        'Episode: {}/{}  | Episode Reward: {:.4f}  | Running Time: {:.4f}'.format(
            ep + 1, EP_MAX, ep_r,
            time.time() - t0
        )
    )
    if ep == 0:
        all_ep_r.append(ep_r)
    else:
        all_ep_r.append(all_ep_r[-1] * 0.9 + ep_r * 0.1)
        OPT_ANGLE.append(S_temp) #px
        Angle.append(angle) #px
    if PLOT_RESULT:
        update_plot.set()

ppo.save_model()
if PLOT_RESULT:
    stop_plot.set()
env.close()`

After I annotated lines 7 to 9, I found that the curve of each training is no longer the same. Is it correct for me to modify it like this,please?

quantumiracle commented 3 years ago

I'm afraid that the problem is not caused by the code you adopted from this repo. And I'm not clear what plotting function did you use, and which environment your are working on.

I guess you mean s = env.reset() for line 7, the environment reset is standard in RL and should not be removed in general. Maybe you are using a very deterministic environment without any noise, then probably the learning curve will show up to be the same if the model uses exactly the same samples for update during the whole learning process. But it looks to me this is less likely to happen because in choose_action there is some randomness in sampling. So I would say check more of the plotting code you used.

Alexzzdfjcn commented 3 years ago

`class Drawer: def init(self, comments=''): global update_plot, stop_plot update_plot = threading.Event() update_plot.set() stop_plot = threading.Event() stop_plot.clear() self.title = ARGNAME if comments: self.title += '' + comments

def plot(self):
    plt.ion()
    clear_output(True) #px1013
    global all_ep_r, update_plot, stop_plot, Angle, OPT_ANGLE
    all_ep_r = []
    Angle = []
    OPT_ANGLE = []
    while not stop_plot.is_set():
        if update_plot.is_set():
            plt.figure(num=1,figsize=(20,5))
            plt.cla()
            plt.title('Reward') #px
            plt.plot(all_ep_r)
            # plt.ylim(-2000, 0)
            plt.xlabel('Episode')
            plt.ylabel('Moving averaged episode reward')
            plt.savefig(os.path.join('fig','Morphing reward_' + time_str))
            plt.figure(num=2,figsize=(20,5))
            plt.cla()
            plt.title('Angle')
            x=list(range(0,len(Angle)))
            plot1 = plt.plot(Angle, 'r-', label = 'angle')
            plot2 = plt.plot(OPT_ANGLE, 'b--', label = 'opt_angle')
            # plt.ylim(-2000, 0)
            plt.xlabel('Episode')
            plt.ylabel('Morphing Angle')
            plt.savefig(os.path.join('fig','Morphing Angle_' + time_str))
            plt.legend()
            update_plot.clear()
            #px
        plt.draw()
        plt.pause(0.1)
    plt.ioff()
    plt.close()`

This is the drawing code I used. I don't think there's anything strange. Could you please help me have a look?

I think the problem lies in this code env.seed(RANDOMSEED) np.random.seed(RANDOMSEED) torch.manual_seed(RANDOMSEED) Because the random number seed is used, the random number generated by the system is the same every time, so the action selection of each step is the same. I don't know if I am right?

quantumiracle commented 3 years ago

If it's like you said, you can simply verify it by using different random seeds.

Alexzzdfjcn commented 3 years ago

Ok,i will try it. Thanks for your help!