nicklashansen / tdmpc2

Code for "TD-MPC2: Scalable, Robust World Models for Continuous Control"
https://www.tdmpc2.com
MIT License
272 stars 49 forks source link

Possible implementation error in buffering td experience in OnlineTrainer: line 101-102? #34

Closed anh-nn01 closed 2 weeks ago

anh-nn01 commented 2 weeks ago

Hi @nicklashansen ,

My question is about lines 101-102 in the OnlineTrainer.py file:

obs, reward, done, info = self.env.step(action)
self._tds.append(self.to_td(obs, action, reward))

My understanding is that you take a step in the environment using the current action action, and then get back new observations obs and reward. Then, you buffered the experience in TensorDict (obs, action, reward).

It seems to me in line 102, obs is the next state (updated in line 101), which is not the current state associated with action. In other words, what is currently buffered in the experience is obs[t+1], action[t], reward[t] instead of obs[t], action[t], reward[t]. Am I misunderstanding something? Would you mind clarifying the above implementation?

Thank you very much!

nicklashansen commented 2 weeks ago

Your understanding of the code is correct! We account for this offset when sampling from the replay buffer: https://github.com/nicklashansen/tdmpc2/blob/5f6fadec0fec78304b4b53e8171d348b58cac486/tdmpc2/common/buffer.py#L73-L82

anh-nn01 commented 2 weeks ago

Thank you! The offset in buffer.py clears my concern!