Closed anonymous-pusher closed 1 year ago
Yes, you are right. By including multiple timesteps into the disc observation, the reward is not longer markovian. But this doesn't seems to be that bad. There doesn't seem to be a negative impact on performance (at least with relative short histories) and the motion quality tends to be better.
Regarding the latents for the encoder. Yes, the latent can change over the course of the rollout. This will just means that it might be impossible for the encoder to correctly predict the z for those timesteps. But it can still do a good job on the other timesteps when z is fixed. The latents are still fixed for most timesteps. So on average that might not be so bad.
Hello Jason,
I have some questions regarding the encoding of transitions in the latent space. The paper describes the encoding of transitions between states at t and t+1. In practice however, you use multiple steps for both AMP and the encoding. I understand that it helps for learning complex behavior over long horizons (10 is the default here); For example, the humanoid in AMP cannot learn the backflip using only a transition of 2 steps. I think there might be two issues here though:
The framework becomes non markovian with numAMPObsSteps>2, as the reward is given for the past 9 steps, while the policy only takes the state at the current t.
The encoder also uses a sequence of numAMPObsSteps observations to encode into a latent z. This assumes that the policy was following the same z when producing them, but during training the latent z can be updated at resets or after some random latent_steps (sampled uniformly between 1 and 150), so some parts in the amp_observation could have been generated with a different latent from the one used in the current time step.
Thank you