Padding / attention_mask questions

DaveyBiggers commented 2 years ago

Hi, thanks for making your code available. I'm trying to wrap my head around the padding in your gym implementation.

In this code: https://github.com/kzl/decision-transformer/blob/c9e6ac0b75895cef3e7c06cd309fd398ec9ceef5/gym/experiment.py#L154 you are padding your inputs on the left, and creating an attention_mask so that the model will ignore the padding.

According to this (possibly out-of-date?) comment on the Hugging Face repo, GPT should ideally be padded on the right, and then the causal masking will take care of making sure nothing is conditioned on the padding values, making the attention_mask unnecessary:

GPT-2 is a model with absolute position embeddings (like Bert) so you should always pad on the right to get best performances for this model (will add this information to the doc_string).

As it's a causal model (only attend to the left context), also means that the model will not attend to the padding tokens (which are on the right) for any real token anyway.

So in conclusion, no need to take special care of avoiding attention on padding.

Just don't use the output of the padded tokens for anything as they don't contain any reliable information (which is obvious I hope).

(see https://github.com/huggingface/transformers/issues/808#issuecomment-522932583 )

Can you explain the rationale behind the padding scheme? Or am I just getting the wrong end of the stick? Cheers!

kzl commented 2 years ago

Since we remove the absolute position encodings implemented in the original GPT-2 code and instead add our own based on the episodic timestep here, the first point is irrelevant.

The choice to pad left or right is arbitrary with only minor implementational effects, so you could change the padding to right if you want.

As to the second and third points, you still have to have some kind of padding if you want to process inputs in batch or else the tensors will have different lengths/

CeyaoZhang commented 1 year ago

Hi, thanks for making your code available. I'm trying to wrap my head around the padding in your gym implementation.

In this code:

https://github.com/kzl/decision-transformer/blob/c9e6ac0b75895cef3e7c06cd309fd398ec9ceef5/gym/experiment.py#L154

you are padding your inputs on the left, and creating an attention_mask so that the model will ignore the padding. According to this (possibly out-of-date?) comment on the Hugging Face repo, GPT should ideally be padded on the right, and then the causal masking will take care of making sure nothing is conditioned on the padding values, making the attention_mask unnecessary:

GPT-2 is a model with absolute position embeddings (like Bert) so you should always pad on the right to get best performances for this model (will add this information to the doc_string). As it's a causal model (only attend to the left context), also means that the model will not attend to the padding tokens (which are on the right) for any real token anyway. So in conclusion, no need to take special care of avoiding attention on padding. Just don't use the output of the padded tokens for anything as they don't contain any reliable information (which is obvious I hope).

(see huggingface/transformers#808 (comment) )

Can you explain the rationale behind the padding scheme? Or am I just getting the wrong end of the stick? Cheers!

At first glance, this was also strange for me to pad from the left, but then I realized that it doesn't matter from the left or the right. The key point is to be consistence with the mask. This mask is the attention_mask input in the GPT2Model, where 1 means the token is not masked and 0 means the one masked. This attenion_mask input differs from the attention matrix, a triangle in the GPT2 model.

kzl / decision-transformer

Padding / attention_mask questions #36