Closed FinAminToastCrunch closed 4 months ago
From my understanding, after setting an initial target reward for the entire episode, for each reward we get after performing a step, we do
new_target_reward = target_reward - received_reward_from_env
Is this correct?
Multi-Game Decision Transformers (https://arxiv.org/abs/2205.15241) looks into learning the RTG like a value function, removing the need to specify it manually.
I understand, during training, at each time step, the transformer is fed the "return to go"
During inference, how would we compute the return to go which would need to be computed before each action?
Do we do "desired reward"/episode_length?