kzl / decision-transformer

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.
MIT License
2.33k stars 440 forks source link

Why does the training loss only consider actions and not rewards or state #76

Closed LMCAV closed 2 months ago

LMCAV commented 2 months ago

Hello, this is a very excellent job. However, I have a small question: why does the training loss only consider actions and not rewards or states? I think if rewards or states were considered, it would better understand the environment and the corresponding feedback. It may helpful for model performance. Is that correct? What problems or impacts could this bring? Could you please provide an explanation? Thank you very much!

kzl commented 2 months ago

Hi, as per Section 3 of the paper:

Training. We are given a dataset of offline trajectories. We sample minibatches of sequence length K from the dataset. The prediction head corresponding to the input token st is trained to predict at – either with cross-entropy loss for discrete actions or mean-squared error for continuous actions – and the losses for each timestep are averaged. We did not find predicting the states or returns-to-go to improve performance, although it is easily permissible within our framework (as shown in Section 5.4) and would be an interesting study for future work.

It is possible it might be beneficial, but it wasn't for our environments.

Best, Kevin