Do we need a config to change `padding_side='left` before the evaluation?

gary-young commented 3 days ago

Feature request

I am trying to train a Llama model (a decoder-only model). I want to evaluate my model with not only the loss but also some generation-based metric. For example, my eval dataset could be a str as 1+2=, and I use the Seq2seqTrainer which provides the modified prediction step so I can get the prediction of the model in the EvalPrediction. Then I write my eval code in the function compute_metrics and provide it for the Seq2seqTrainer.

The problem is about the padding_side of the tokenizer. Because I need to train the model, the tokenizer should be right padding in training dataset. (Because it is the default setting of Llama.) However, when I try to evaluate the model, the tokenizer should be changed into left padding because I need my model to generate. I do not find a easy way to do this, unless I change the source code of the trainer (for example, the get_eval_dataloader method of the Trainer).

My questions are:

Is it correct way to evaluate a decoder-only model in a generation-based way? Should I use the Seq2seqTrainer or is there some other methods I have not found? (Is there an example doc?)
Can I just train a model with right padding but evaluate it with left padding? If not, how should I evaluate models like Llama?
If my evaluate process is correct, how can I change the padding_side as right at the begining of the evaluation and change it back to left after the evaluation? (I think if we have the seperated training_data_coallotor and test_data_coallotor, the problem could be solved. Is it possbile for the current transformers Trainer? Or any other way to implement it?)

Motivation

Motivation: generation-based evaluation when we train a decoder-only autoregressive model like llama.

Your contribution

I do not know what I can help.

zucchini-nlp commented 3 days ago

Hey!

Compute metrics with generation for decoder-only models does not work currently. See #26474 and the linked issues requesting the feature.

I am planing to work on it next week :)

gary-young commented 1 day ago

@zucchini-nlp Thank you! Now I implement it by:

remove the answer part in the valid (and test) dataset,
use the Seq2seqTrainer instead of the Trainer, which modified the prediction_step function and their output contain the prediction. (It is tricky because it replace the original logits with the generated token_ids.)
Then I get the prediciton and calculate my metrics in my own compute_metrics function.

From now on, it seems work. Now I solve the padding_side problem by also modifying the get_test_dataloader and get_eval_dataloader function to change the dataloader (more specifically, the data_coallator function.)

I am not sure it is the correct way to implement generation-based evaluation but it seems work.

gary-young commented 1 day ago

@zucchini-nlp Oh, but my implement has a new problem, because the input sequences have been manually trancated, the eval_loss does not make sense.

huggingface / transformers