Latent code and numAMPObsSteps

Hello Jason,

I have some questions regarding the encoding of transitions in the latent space. The paper describes the encoding of transitions between states at t and t+1. In practice however, you use multiple steps for both AMP and the encoding. I understand that it helps for learning complex behavior over long horizons (10 is the default here); For example, the humanoid in AMP cannot learn the backflip using only a transition of 2 steps. I think there might be two issues here though:

The framework becomes non markovian with numAMPObsSteps>2, as the reward is given for the past 9 steps, while the policy only takes the state at the current t.
The encoder also uses a sequence of numAMPObsSteps observations to encode into a latent z. This assumes that the policy was following the same z when producing them, but during training the latent z can be updated at resets or after some random latent_steps (sampled uniformly between 1 and 150), so some parts in the amp_observation could have been generated with a different latent from the one used in the current time step.

Thank you

nv-tlabs / ASE

Latent code and numAMPObsSteps #32