Recently, I've been looking into the detailed implementation of the code in relation to the paper "Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning."
From my understanding, based on the paper, the reward function is bootstrapped in the event of a time-out. I believe this bootstrapping should apply to the subsequent state, following the formula:
$r{new} = r + v(s')$,
$s'$ represents the state resulting from the current step.
However, in the current implementation of the code, it appears to be executed as:
$r{new} = r + v(s)$,
where $s$ is the state used for the current step.
Could you please clarify if my understanding aligns with the intended design?
I am curious to know whether this implementation choice was deliberate for specific reasons or if it might be an oversight.
Hi, thank you for sharing this amazing code.
Recently, I've been looking into the detailed implementation of the code in relation to the paper "Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning."
From my understanding, based on the paper, the reward function is bootstrapped in the event of a time-out. I believe this bootstrapping should apply to the subsequent state, following the formula: $r{new} = r + v(s')$, $s'$ represents the state resulting from the current step. However, in the current implementation of the code, it appears to be executed as: $r{new} = r + v(s)$, where $s$ is the state used for the current step.
Could you please clarify if my understanding aligns with the intended design? I am curious to know whether this implementation choice was deliberate for specific reasons or if it might be an oversight.
Thank you for your time and assistance.