rail-berkeley / rlkit

Collection of reinforcement learning algorithms
MIT License
2.45k stars 550 forks source link

Incomplete relabelling of trajectories in HER #96

Closed rstrudel closed 4 years ago

rstrudel commented 4 years ago

Hi,

First, thanks a lot for releasing such a nice repository. I have been using it for a few months now and I appreciate that it is quite well written and most of the code is self explanatory. I learned a lot just by using it.

I am using SAC-HER and got a lot of divergence issues which I fixed in the end. One of the main problem came from the relabelling of samples in the buffer: https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/data_management/obs_dict_replay_buffer.py#L228-L239 Given a batch, a set of new rewards is computed according to the updated set of goals. However the terminals are not updated, whereas some states might be terminal given the new goal.

And this has its importance in the Bellman update where the terminals variable appear https://github.com/vitchyr/rlkit/blob/90195b24604f513403e4d0fe94db372d16700523/rlkit/torch/sac/sac.py#L128

If the reward if of the form -distance(state, goal) and an episode is only finished because of the maximum path length, then not updating the terminals will have little impact. It may be why this bug passed silently. However I am working with a spare reward which is 1 if distance(state, goal) < epsilon. And in this case, if terminals are not updated then the Q-function blows up. Indeed, if we assume that target_q_values = 1 at the goal, if terminals = 0 then q_target = 2, at the next iteration q_target = 3 and so on. If terminals = 1, i.e. if the state is terminal according to the resampled goal then q_target = 1.

So in my fork of your repository, I replaced:

new_rewards = self.env.compute_rewards(
                new_actions,
                new_next_obs_dict,
            )

by

new_rewards, new_terminals = self.env.compute_rewards(
                new_actions,
                new_next_obs_dict,
            )

where terminals is 1 if distance(state, goal) < epsilon in my case. This fixed the Q-function blow-up issue.

rstrudel commented 4 years ago

I hope this makes sense, I did not propose a PR because it will break compatibility with existing environments.

vitchyr commented 4 years ago

That's a good point! That also makes a lot of sense as to why updating the termination conditioned would help.

One thing that's ambiguous is whether or not the episode should end when the state is reached. Some authors assume that it should while others assume that it shouldn't. For example, the HER code released by the authors don't seem to update the termination condition.

At this point, I'm not sure if I'd feel comfortable making this change for everyone, since it may not be the intended behavior and it requires changing the compute_rewards interface. Perhaps if you make it an option inside of the HER code in a PR, that could be nice a way to make it optional without requiring the interface to change too drastically.

For now I'll pin this issue so that it has more visibility, and feel free to make a PR as suggested above.

rstrudel commented 4 years ago

Thanks @vitchyr for this insightful answer.

One thing that's ambiguous is whether or not the episode should end when the state is reached.

Agreed. Indeed, HER authors have tested the algorithm on environments like gym fetch where an episode terminates only once the horizon is reached. Also, whether it is sparse or dense, they use rewards that are negative and converge to 0 when the state is getting closer to the goal, which avoid blow up issues such as the one I pointed out when the reward is positive near the goal.

bycn commented 4 years ago

@rstrudel Did you find that the horizon termination made a difference in your experiments? (Only that part, not the -1 and 0 part) Or if anyone knows if this is a big factor/ point me towards some papers that address this, please let me know

rstrudel commented 4 years ago

@bycn for my application which is neural motion planning, it did make a difference. In my experiments I got the best results setting the termination when the goal is reached and providing a positive reward in this case. It might be quite experiment dependent however.