Padding side of encoder and LM?

kaistAI / LangBridge

[ACL 2024] LangBridge: Multilingual Reasoning Without Multilingual Supervision

51 stars 6 forks source link

Padding side of encoder and LM? #2

Closed fahadh4ilyas closed 4 months ago

fahadh4ilyas commented 5 months ago

What is the padding side of each encoder and LM tokenizer? I guess the padding side for encoder is left and for LM is right. But then, bos token from LM sequence is put on the most left side of concatenated embedding which will be weird because the bos token location is inconsistent with sequence. Is there any issue in generation if generate in batch?

rahular commented 5 months ago

IIRC both tokenizers have right padding. You can check that as follows

>> tok = AutoTokenizer.from_pretrained('kaist-ai/langbridge_encoder_tokenizer')
>> tok.padding_side

fahadh4ilyas commented 5 months ago

IIRC both tokenizers have right padding. You can check that as follows
>> tok = AutoTokenizer.from_pretrained('kaist-ai/langbridge_encoder_tokenizer')
>> tok.padding_side

So if we train the model in batch with different lengths of input, some of the sequence will be have random padding in the middle (from padding of the encoder)? Will it make the model inconsistent when generating in batch and will suffer the short sequence if somehow it's trained alongside long sequence?

MattYoon commented 5 months ago

Hi @fahadh4ilyas, as @rahular said, they are both padded on the right.

Yes, during training some sequences will have paddings in the middle. But those "middle paddings" are masked for the LM, so it doesn't perform attention to them.

To address your concern on

Will it make the model inconsistent when generating in batch and will suffer the short sequence if somehow it's trained alongside long sequence?

We weren't able to find any problems with batch generations. The evaluation for HumanEval was ran with a batch size of 32.

Just make sure to set use_dynamic_length to True during training. In our preliminary experiments, we found that using various length of input sequence is critical for robust generation during test time.