Open gary-young opened 3 days ago
Hey!
Compute metrics with generation for decoder-only models does not work currently. See #26474 and the linked issues requesting the feature.
I am planing to work on it next week :)
@zucchini-nlp Thank you! Now I implement it by:
prediction_step
function and their output contain the prediction. (It is tricky because it replace the original logits with the generated token_ids.)compute_metrics
function.From now on, it seems work. Now I solve the padding_side problem by also modifying the get_test_dataloader
and get_eval_dataloader
function to change the dataloader (more specifically, the data_coallator function.)
I am not sure it is the correct way to implement generation-based evaluation but it seems work.
@zucchini-nlp Oh, but my implement has a new problem, because the input sequences have been manually trancated, the eval_loss does not make sense.
Feature request
I am trying to train a Llama model (a decoder-only model). I want to evaluate my model with not only the loss but also some generation-based metric. For example, my eval dataset could be a str as
1+2=
, and I use the Seq2seqTrainer which provides the modified prediction step so I can get the prediction of the model in theEvalPrediction
. Then I write my eval code in the functioncompute_metrics
and provide it for the Seq2seqTrainer.The problem is about the padding_side of the tokenizer. Because I need to train the model, the tokenizer should be right padding in training dataset. (Because it is the default setting of Llama.) However, when I try to evaluate the model, the tokenizer should be changed into left padding because I need my model to generate. I do not find a easy way to do this, unless I change the source code of the trainer (for example, the
get_eval_dataloader
method of the Trainer).My questions are:
Motivation
Motivation: generation-based evaluation when we train a decoder-only autoregressive model like llama.
Your contribution
I do not know what I can help.