Closed anh-nn01 closed 2 weeks ago
Your understanding of the code is correct! We account for this offset when sampling from the replay buffer: https://github.com/nicklashansen/tdmpc2/blob/5f6fadec0fec78304b4b53e8171d348b58cac486/tdmpc2/common/buffer.py#L73-L82
Thank you! The offset in buffer.py
clears my concern!
Hi @nicklashansen ,
My question is about lines 101-102 in the
OnlineTrainer.py
file:My understanding is that you take a step in the environment using the current action
action
, and then get back new observationsobs
andreward
. Then, you buffered the experience in TensorDict(obs, action, reward)
.It seems to me in line 102,
obs
is the next state (updated in line 101), which is not the current state associated withaction
. In other words, what is currently buffered in the experience isobs[t+1], action[t], reward[t]
instead ofobs[t], action[t], reward[t]
. Am I misunderstanding something? Would you mind clarifying the above implementation?Thank you very much!