kzl / decision-transformer

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.
MIT License
2.33k stars 440 forks source link

Confusion over shape of returns_to_go in get_batch #38

Open DaveyBiggers opened 2 years ago

DaveyBiggers commented 2 years ago

Hi, I'm trying to understand the following code in gym/experiment.py/get_batch():

rtg.append(discount_cumsum(traj['rewards'][si:], gamma=1.)[:s[-1].shape[1] + 1].reshape(1, -1, 1))
if rtg[-1].shape[1] <= s[-1].shape[1]:
    rtg[-1] = np.concatenate([rtg[-1], np.zeros((1, 1, 1))], axis=1)
...
tlen = s[-1].shape[1]

( from https://github.com/kzl/decision-transformer/blob/master/gym/experiment.py#:~:text=rtg.append(discount_cumsum,1))%5D%2C%20axis%3D1) )

As far as I can understand it, it's creating a sequence of (tlen + 1) rtg values, then checking whether the sequence length is <= tlen, and padding it with an extra value if not. (I'm struggling to see how this situation will ever arise.) A few lines later, the padding code is applied, pre-padding with 0s to make sure everything is length max_len, except for rtg, which will now be length max_len + 1.

I don't understand the purpose of this extra value, especially since it seems to get stripped anyway by the SequenceTrainer:

state_preds, action_preds, reward_preds = self.model.forward(
    states, actions, rewards, rtg[:,:-1], timesteps, attention_mask=attention_mask,
)

Am I missing something? Thanks!