Potential bug: Attention mask allows access to future tokens?

kzl / decision-transformer

Official codebase for Decision Transformer: Reinforcement Learning via Sequence Modeling.

MIT License

2.33k stars 440 forks source link

Potential bug: Attention mask allows access to future tokens? #30

Closed donthomasitos closed 2 years ago

donthomasitos commented 2 years ago

Thanks for sharing your code, great work. I do not get one detail here, maybe you can help:

During the batch data generation, you fill the masks with one's where ever valid history trajectory data is available: https://github.com/kzl/decision-transformer/blob/f04280e3668a992c41b38bdfb6b6181d61b4dc52/gym/experiment.py#L154

Isn't it common practice with auto regression to use a triangular matrix? In your training code, you consider all actions where the mask is >0. Doesn't this result in allowing actions early in the sequence to access subsequent tokens? https://github.com/kzl/decision-transformer/blob/5605e40ce763bb27ff0cf5beb06210d08771e047/gym/decision_transformer/training/seq_trainer.py#L18

Thanks!

aleSuglia commented 2 years ago

Thanks @donthomasitos, I think you're very right. This looks like a bug from my point of view. I hope we're wrong :)

kzl commented 2 years ago

See the documentation for attention mask here:

"attention_mask (torch.FloatTensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked."

This is not the same as the triangular matrix used when computing the attention values.

lericson commented 4 months ago

Hello, sorry to dig up an old issue like this - but where exactly is this triangular attention mask applied? We traced the execution path from trainer.py all the way to Attention.forward and could not find any triangular attention masking.

kzl commented 4 months ago

You have to continue going inside the forward of attention :)

The GPT-2 transformer in huggingface implements the causal triangular attention mask there. We don't implement it ourselves.

lericson commented 3 months ago

We followed the code all the way to the actual MHA computations. At any rate, it turns out that the d3rlpy implementation used in https://github.com/takuseno/d3rlpy/blob/master/reproductions/offline/decision_transformer.py does causal attention (using torch.tril). Sincerely - LudvigOn 8 May 2024, at 05:23, Kevin Lu @.***> wrote: You have to continue going inside the forward of attention :) The GPT-2 transformer in huggingface implements the causal triangular attention mask there. We don't implement it ourselves.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>