Closed AntoineTheb closed 4 years ago
@jhlegarreta Good point, I included a mention on the usage of behavior/skill space. Also, I initially did not include the part on hierarchical RL because I'm not too knowlegable on HRL but you're right I should have mentionned it. I also realized that the figure also shows performance against model-free RL so that's a plus.
Two questions:
I don't understand how maximizing mutual information between s and s' is maximizing the diversity of the behaviors. Can you explain, please? Or point something to read.
I understand that the method plans by constructing a sequence of states using q(s' | s, z), and that the reward can be computed from s. Is s a latent vector describing the state? If so, I'm not sure how you can get the reward from that. With the value function?
I don't understand how maximizing mutual information between s and s' is maximizing the diversity of the behaviors. Can you explain, please? Or point something to read.
MI is actually maximized between s' and z, conditionned on s, so that z explains s' given s. I can point you to line 35 of my review 😬 As to why it's promoting the diversity of behaviors, again I can point you to equation (1) and (2) of the review, where the relation between the MI and the entropy H(s'|s) is shown. Maximizing H(s'|s) means promoting diverse transitions, but equation 2 constrains the diverse transitions to be explainable by z, otherwize H(s'|s,z) would also be high and thus the subtraction would give a low score. I agree this can be cryptic from just the equations, I'll clarify
I understand that the method plans by constructing a sequence of states using q(s' | s, z), and that the reward can be computed from s. Is s a latent vector describing the state? If so, I'm not sure how you can get the reward from that. With the value function?
No, s follows standard RL notation and thus represents the state of the environment. $$q(s'|s,z)$$ are the learned dynamics of the environment, conditioned on the latent variable. So the model does not actually learn the whole environment dynamics but only the subset that pertains to the skills learned. I'll clarify. Because s does represent the state, you can feed it to reward function as well as a candidate action coming from your behaviors to get the actual environment reward. I'll clarify that also.
@AntoineTheb Thanks! I understand now.
~Not ready for review yet~ Donesies