pfnet / pfrl

PFRL: a PyTorch-based deep reinforcement learning library
MIT License
1.2k stars 157 forks source link

Hindsight Experience Replay #6

Open prabhatnagarajan opened 4 years ago

prabhatnagarajan commented 4 years ago

Hindsight Experience Replay with bit-flipping example: https://arxiv.org/abs/1707.01495

peasant98 commented 4 years ago

Hi,

What's the current status of this?

prabhatnagarajan commented 4 years ago

I'm currently working on it (on-and-off) on the following branch of my personal fork: https://github.com/prabhatnagarajan/pfrl/tree/her. I'm planning on applying HER to the bit-flip environment from the original paper that introduced HER. I'm fairly confident the Hindsight Experience Replay implementation is good, as we've used a variant of it for other projects successfully. However, currently my performance on the bit-flip environment is poor and requires investigation.

peasant98 commented 4 years ago

Ah cool, thanks for the update.

abagaria commented 3 years ago

HER requires that we make updates to the agent's policy+Q-function at the end of the episode. But, PFRL assumes that an agent.act(s) is followed by an agent.observe(s', r) (as evidenced by their use of batched_last_action to keep track of actions). How are you going to deal with that?

prabhatnagarajan commented 3 years ago

Note that the HindsightReplayBuffer extends the EpisodicReplayBuffer. If you see the data structures within the EpisodicReplayBuffer, you can see that the episodic buffer maintains a current_episode which is only appended to the larger replay buffer when an episode is stopped. This ensures that when we perform updates, we're not using incomplete episodes.

About the use of batch_last_action, I'm not entirely sure what you're asking. If you see this function , we're using batch_last_action, yes, but it's being added to the replay buffer, not being used for updates. At the end of the function we call self.replay_updater.update_if_necessary(self.t) which will perform a gradient update, but it will not use batch_last_action.

Does this answer your question? If not, feel free to clarify and I'll do my best to answer.