Qustion about Offline Data Preprocess

yueyang130 commented 10 months ago

Hi, @nicklashansen

Thanks for your wonderful work!

I read your code for loading the offline multitask dataset and have a question. Specifially, as shown above, you make one timestep shift for action and reward w.r.t. obs. Does it means $a{t+1}$ and $r{t+1}$ correspond to $s_t$ in your data organization? And thus the first action and reward in an episode is useless (no corresponding state)?

https://github.com/nicklashansen/tdmpc2/blob/f3139291e2dc8e47480184a4a1bce05e8980caa3/tdmpc2/common/buffer.py#L28

nicklashansen commented 10 months ago

Correct! I agree that this could be made more clear in the code documentation. This is because episodes contain one more observation than actions and rewards. $a$ and $r$ are set to filler random/NaN values at the first time index of an episode (at env.reset()), here:

https://github.com/nicklashansen/tdmpc2/blob/445af9d81d9f459ebeec4f43995ede2ee573e1fd/tdmpc2/trainer/online_trainer.py#L93-L94

which creates a TensorDict with filler values for action and reward when those arguments are not provided:

https://github.com/nicklashansen/tdmpc2/blob/445af9d81d9f459ebeec4f43995ede2ee573e1fd/tdmpc2/trainer/online_trainer.py#L50-L65

Perhaps this should also be made more consistent by setting both actions and rewards to NaN values.

nicklashansen commented 9 months ago

Hi @yueyang130, I just wanted to let you know that commit https://github.com/nicklashansen/tdmpc2/commit/1f6c7771b92edd8d5502f910d5582ebf8ee88675 has now changed the default behavior to assign nan values to all filler indices (along with various other code improvements). Closing this issue, but feel free to reopen or open another issue if you have any other questions.

yueyang130 commented 9 months ago

Hi, Nicklas, I got it!

Thanks for your reminder!

nicklashansen / tdmpc2

Qustion about Offline Data Preprocess #8