hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.1k stars 727 forks source link

[question] HER does not sample very last state of episode as achieved goal #666

Open nicoguertler opened 4 years ago

nicoguertler commented 4 years ago

It seems to me that when HER samples an achieved goal from the replay buffer it never samples the very last state of the episode. Is this intended?

As a consequence, the sampling strategy "final" seems to always sample the second to last state. My understanding is that the paper suggests using the very last state and in some environments this could actually make a difference.

https://github.com/nicoguertler/stable-baselines/blob/4fada47f1b71b7548c935b1f01c6fb04199b3d54/stable_baselines/her/replay_buffer.py#L99-L125

araffin commented 4 years ago

Hello,

It seems to me that when HER samples an achieved goal from the replay buffer it never samples the very last state of the episode.

https://github.com/hill-a/stable-baselines/blob/4fada47f1b71b7548c935b1f01c6fb04199b3d54/stable_baselines/her/replay_buffer.py#L113

the index [-1] in python is the last element of a list, no? (and indices start at zero)

EDIT: this may be related to #400

nicoguertler commented 4 years ago

Hello and thank you for your answer!

The episode transitions are tuples containing the last observation, action, reward, new observation and the done boolean.

https://github.com/hill-a/stable-baselines/blob/4fada47f1b71b7548c935b1f01c6fb04199b3d54/stable_baselines/her/replay_buffer.py#L75

The goal selection strategy final indeed chooses the last transition but then uses the old observation (index 0) which corresponds to the second to last observation of the episode.

https://github.com/hill-a/stable-baselines/blob/4fada47f1b71b7548c935b1f01c6fb04199b3d54/stable_baselines/her/replay_buffer.py#L125

In my opinion it would be closer to what the paper suggests to use the third element of the transition (corresponding to the new and thus very last observation) in case of the goal selection strategy final.

This could be relevant if the last observation of the episode is important. For example if the episode ends with a contact between objects that only appears in the very last observation.

The other goal selection strategies do not use the very last observation either but this is probably not a big problem as long as the episodes are long enough. In issue #578, however, it is reported that the implementation of HER does not work properly if the episode length is 1. I think this is precisely because of the behavior described above.

Thank you for your help!

araffin commented 4 years ago

Thanks for the clarification. For #578 , it seems normal for the future strategy (cf answer: https://github.com/hill-a/stable-baselines/issues/578#issuecomment-581178005)

For the rest, I need to think more about it.