leggedrobotics / rsl_rl

Fast and simple implementation of RL algorithms, designed to run fully on GPU.
Other
656 stars 186 forks source link

Handling of `timeouts` in Generalized Advantage Estimation #43

Open mohakbhardwaj opened 3 weeks ago

mohakbhardwaj commented 3 weeks ago

Hello,

Thank you for this great library! I had a question about the handling of timeouts when computing Generalized Advantage Estimation, specifically in the following line

https://github.com/leggedrobotics/rsl_rl/blob/96393c41c55c0a905eff035875b97f3b837225fe/rsl_rl/algorithms/ppo.py#L346

If my understanding is correct, when a trajectory ends in a state that is terminal (i.e a bad state like robot falling) it is treated as an absorbing state and hence the TD error is simply reward - value, however, if it is truncated due to episode timing out, the agent still needs to reason about the long term value from the next state. However, in the above line the rewards are simply augmented with the value prediction for that state multiplied by the discount factor. Hence, the TD error for timeout states would be r + \gamma * value - value.

  1. Could you please explain intuitively or mathematically the rationale behind the handling of timeouts in the GAE computation?
  2. When designing an environment, should done be returned as True for both termination and timeout?
  3. Should we interpret done and timeout as corresponding to the next environment state (i.e after physics step) or current state (before physics step)?

Hope the above questions make sense, and happy to clarify more!