Closed yueyang130 closed 9 months ago
Correct! I agree that this could be made more clear in the code documentation. This is because episodes contain one more observation than actions and rewards. $a$ and $r$ are set to filler random/NaN values at the first time index of an episode (at env.reset()
), here:
which creates a TensorDict with filler values for action and reward when those arguments are not provided:
Perhaps this should also be made more consistent by setting both actions and rewards to NaN values.
Hi @yueyang130, I just wanted to let you know that commit https://github.com/nicklashansen/tdmpc2/commit/1f6c7771b92edd8d5502f910d5582ebf8ee88675 has now changed the default behavior to assign nan
values to all filler indices (along with various other code improvements). Closing this issue, but feel free to reopen or open another issue if you have any other questions.
Hi, Nicklas, I got it!
Thanks for your reminder!
Hi, @nicklashansen
Thanks for your wonderful work!
I read your code for loading the offline multitask dataset and have a question. Specifially, as shown above, you make one timestep shift for action and reward w.r.t. obs. Does it means $a{t+1}$ and $r{t+1}$ correspond to $s_t$ in your data organization? And thus the first action and reward in an episode is useless (no corresponding state)?
https://github.com/nicklashansen/tdmpc2/blob/f3139291e2dc8e47480184a4a1bce05e8980caa3/tdmpc2/common/buffer.py#L28