hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.1k stars 727 forks source link

[potential bug] HER Replay buffer observations #745

Open johannes-dornheim opened 4 years ago

johannes-dornheim commented 4 years ago

Hi,

it seems like the state features that are added to the replay buffer in the HindsightExperienceReplayWrapper are features for the full observation-dictionary (observation, achieved-goal, desired-goal).

https://github.com/hill-a/stable-baselines/blob/cfcdb2fc79260b25aa0c32faf4cf5d551f166f29/stable_baselines/her/replay_buffer.py#L153 https://github.com/hill-a/stable-baselines/blob/cfcdb2fc79260b25aa0c32faf4cf5d551f166f29/stable_baselines/her/replay_buffer.py#L184

the feature-vectors 'obs_t' and 'obs_tp1' are generated by:

    def convert_dict_to_obs(self, obs_dict):
        """
        :param obs_dict: (dict<np.ndarray>)
        :return: (np.ndarray)
        """
        # Note: achieved goal is not removed from the observation
        # this is helpful to have a revertible transformation
        if isinstance(self.observation_space, spaces.MultiDiscrete):
            # Special case for multidiscrete
            return np.concatenate([[int(obs_dict[key])] for key in KEY_ORDER])
        return np.concatenate([obs_dict[key] for key in KEY_ORDER])

from the full (observation, achieved-goal, desired-goal) observation-dictionaries.

Imho, the 'achieved-goal' part is wrong. As you can see in the original publication of HER ([1], Listing at P5) the 'goal-state' is represented by (observation, desired-goal)-features only. In most cases the extra-features should affect the convergence of the underlying model. In my current application (goals are described by ~80 variables) it may lead to convergence problems (experiments are pending)

[1] https://arxiv.org/abs/1707.01495

araffin commented 4 years ago

Hello,

are features for the full observation-dictionary

yes, this was a choice (to makes things simpler). Please tell us if it makes a difference.

johannes-dornheim commented 4 years ago

after first experiments (w. DQN and my personal Env) it seems like it does not make much difference reg. the convergence speed. I wil come back when i have time to produce quantitative results.

for the experiments i added an extra dict (hash(obs,desired):achievedl) to enable the inversion of 'convert_dict_to_obs' . her/utils.py:

    def __init__(self, env):
        self.achieved_goals = {}
        ...
     ...

    def convert_dict_to_obs(self, obs_dict):
        """
        :param obs_dict: (dict<np.ndarray>)
        :return: (np.ndarray)
        """
        # achieved goals are stored in extra dict hash(obs,desired):achieved !
        # Assumes (obs,desired)->achieved to be unique !
        # --------------------------------------------------------------------------------------------

        if isinstance(self.observation_space, spaces.MultiDiscrete):
            # Special case for multidiscrete
            obs = np.concatenate([[int(obs_dict[key])] for key in ['observation', 'desired_goal']])
        else:
            obs = np.concatenate([obs_dict[key] for key in ['observation', 'desired_goal']])

        self.achieved_goals[hash(obs.data.tobytes())] = obs_dict['achieved_goal']
        return obs

    def convert_obs_to_dict(self, observations):
        """
        Inverse operation of convert_dict_to_obs

        :param observations: (np.ndarray)
        :return: (OrderedDict<np.ndarray>)
        """
        return OrderedDict([
            ('observation', observations[:self.obs_dim]),
            ('achieved_goal', self.achieved_goals[hash(observations.data.tobytes())]),
            ('desired_goal', observations[self.obs_dim:]),
        ])