Why does the training loss only consider actions and not rewards or state

kzl / decision-transformer

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

MIT License

2.33k stars 440 forks source link

Hi, as per Section 3 of the paper:

Training. We are given a dataset of offline trajectories. We sample minibatches of sequence length K from the dataset. The prediction head corresponding to the input token st is trained to predict at – either with cross-entropy loss for discrete actions or mean-squared error for continuous actions – and the losses for each timestep are averaged. We did not find predicting the states or returns-to-go to improve performance, although it is easily permissible within our framework (as shown in Section 5.4) and would be an interesting study for future work.

It is possible it might be beneficial, but it wasn't for our environments.

Best, Kevin

kzl / decision-transformer

Why does the training loss only consider actions and not rewards or state #76