tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.39k stars 4.03k forks source link

Question about padding the input sequence #294

Open BaleChen opened 1 year ago

BaleChen commented 1 year ago

https://github.com/tatsu-lab/stanford_alpaca/blob/761dc5bfbdeeffa89b8bff5d038781a4055f796a/train.py#L90-L99

In this snippet of code, from what I understand, the padding is not added since using "longest" mode on a single sequence is equivalent to adding no paddings as per this doc. Is it right? So the padding for each prompt is added by the data collator instead of here.

I wonder if it would be clearer if you just write padding=False here or add a comment about it.

srhthu commented 1 year ago

I think so.. Actually they use the dynamic padding by the "DataCollatorForSupervisedDataset". My concern is should the padding tokens be at left rather than right? The other repo https://github.com/tloen/alpaca-lora padding to the left, which makes sense for batch training.

maksimstw commented 1 year ago

Agree with @srhthu. I think left padding makes more sense, but the train.py used right padding instead. I think the code they use to train Alpaca is simply not correct for batch training. See the explanation here.

BaleChen commented 1 year ago

Hey @maksimstw,

My previous understanding is that batch inference with decoder models requires us to do left padding. But at the fine-tuning stage, right-side padding is okay as long as we set the attention mask correctly and turn pad tokens to -100 when calculating loss.

Is it the case that we can just simply use left padding for both training and inference in generation tasks?