Closed fahadh4ilyas closed 4 months ago
IIRC both tokenizers have right padding. You can check that as follows
>> tok = AutoTokenizer.from_pretrained('kaist-ai/langbridge_encoder_tokenizer')
>> tok.padding_side
IIRC both tokenizers have right padding. You can check that as follows
>> tok = AutoTokenizer.from_pretrained('kaist-ai/langbridge_encoder_tokenizer') >> tok.padding_side
So if we train the model in batch with different lengths of input, some of the sequence will be have random padding in the middle (from padding of the encoder)? Will it make the model inconsistent when generating in batch and will suffer the short sequence if somehow it's trained alongside long sequence?
Hi @fahadh4ilyas, as @rahular said, they are both padded on the right.
Yes, during training some sequences will have paddings in the middle. But those "middle paddings" are masked for the LM, so it doesn't perform attention to them.
To address your concern on
Will it make the model inconsistent when generating in batch and will suffer the short sequence if somehow it's trained alongside long sequence?
We weren't able to find any problems with batch generations. The evaluation for HumanEval was ran with a batch size of 32.
Just make sure to set use_dynamic_length
to True
during training. In our preliminary experiments, we found that using various length of input sequence is critical for robust generation during test time.
What is the padding side of each encoder and LM tokenizer? I guess the padding side for encoder is left and for LM is right. But then, bos token from LM sequence is put on the most left side of concatenated embedding which will be weird because the bos token location is inconsistent with sequence. Is there any issue in generation if generate in batch?