hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

PPO - Meaning of update_fac and timestep variables #1135

Open huvar opened 3 years ago

huvar commented 3 years ago

I am trying to understand the meaning of these variables. Looking at the non-reccurent version _updatefac seems to be the number of steps per each update. But in the recurrent version it does not (either by design or bug).

For convenience I copied the code bellow. Can anyone help?

if states is None:  # nonrecurrent version
    update_fac = max(self.n_batch // self.nminibatches // self.noptepochs, 1)
    inds = np.arange(self.n_batch)
    for epoch_num in range(self.noptepochs):
        np.random.shuffle(inds)
        for start in range(0, self.n_batch, batch_size):
            timestep = self.num_timesteps // update_fac + ((epoch_num *
                                                            self.n_batch + start) // batch_size)
            end = start + batch_size
            mbinds = inds[start:end]
            slices = (arr[mbinds] for arr in (obs, returns, masks, actions, values, neglogpacs))
            mb_loss_vals.append(self._train_step(lr_now, cliprange_now, *slices, writer=writer,
                                                 update=timestep, cliprange_vf=cliprange_vf_now))
else:  # recurrent version
    update_fac = max(self.n_batch // self.nminibatches // self.noptepochs // self.n_steps, 1)
    assert self.n_envs % self.nminibatches == 0
    env_indices = np.arange(self.n_envs)
    flat_indices = np.arange(self.n_envs * self.n_steps).reshape(self.n_envs, self.n_steps)
    envs_per_batch = batch_size // self.n_steps
    for epoch_num in range(self.noptepochs):
        np.random.shuffle(env_indices)
        for start in range(0, self.n_envs, envs_per_batch):
            timestep = self.num_timesteps // update_fac + ((epoch_num *
                                                            self.n_envs + start) // envs_per_batch)
            end = start + envs_per_batch
            mb_env_inds = env_indices[start:end]
            mb_flat_inds = flat_indices[mb_env_inds].ravel()
            slices = (arr[mb_flat_inds] for arr in (obs, returns, masks, actions, values, neglogpacs))
            mb_states = states[mb_env_inds]
            mb_loss_vals.append(self._train_step(lr_now, cliprange_now, *slices, update=timestep,
                                                 writer=writer, states=mb_states,
                                                 cliprange_vf=cliprange_vf_now))
Miffyli commented 3 years ago

Related #134

The original implementation comes from original baselines, I believe. Thankfully it only seems to be for logging purposes (tracking when updates happen), and not used for actual training logic, so I think you can safely ignore it.