compute_metrics with causal LM training

steremma commented 9 months ago

Feature request

Besides loss, users often need to report additional metrics throughout the training in order to drive decision making and communicate results, which in the case of Seq2Seq models is elegantly done with the compute_metrics argument of the Trainer. Generative metrics easily fit this framework by setting predict_with_generate=True. The same is much less straightforward with a Causal underlying LM. The only "working" approach I found is this: https://github.com/huggingface/transformers/blob/5e11d72d4d0939138fbabfebe9a69d2061519547/examples/pytorch/language-modeling/run_clm.py#L578

But I think this is an erroneous calculation: the logits.argmax(dim=-1) call does not really generate in inference mode, it "cheats" because of teacher forcing and therefore any metric computed that way is probably inflated. Ideally it would be possible to make the argument passed to compute_metrics include a proper predictions attribute that has been properly generated using the trainers generation config.

Motivation

I am always frustrated when I can't observe the learning trajectory of my generative metric (say BLEU/ROUGE) when using a CML even though it is trivial to do when I am using a S2S

Your contribution

If you confirm that this is an issue and important enough to justify a fix I may be able to make a PR but can't promise it

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

eranhirs commented 6 months ago

For causal LM fine-tuning with instruction tuning (i.e., completion only) I use the SFTTrainer from the trl library and it suffers from the same problem. Perhaps the solution I suggested there would provide some necessary insights.

kirk86 commented 1 month ago

I don't know why this issue is closed but it's really a necessity in order to avoid spending unnecessary computation and easily selecting the best possible model. We should have something that works well across different generative models.

At the moment most causal models generate predictions with shape not the same as labels which causes an error when computing different metrics.

amyeroberts commented 4 weeks ago

cc @gante @zucchini-nlp @muellerzr

zucchini-nlp commented 4 weeks ago

Same issue as #31462 and I am very much pro of having this as feature. After having tried to tune a VLMs with HF Trainer, noticed that there are some points that could be improved.

The main issue here, as I see it, is that decoder-only models need different inputs for generation and for loss calculation. I can propose two options:

If users pass in already tokenized inputs, they could add smth like "generation_mask" which will be used to crop out the generation part from input_ids, since I think passing whole input_ids and generation_input_ids is kinda double work. Maybe this can be part of the DataCollatorForLanguageModeling activated by a flag predict_with_generate
If users want to pass in simple text format dataset, we can have a data collator that is similar to DataCollatorForCompletionTask but will return generation_input_ids along with the usual inputs

Let me know what you think :)

gante commented 4 weeks ago

@zucchini-nlp asking our internal fine-tune folks for suggestions here would be great, I'm sure they have opinions about it (given the related thread in TRL)!

btw, let me know if you have bandwidth to take this issue. Since there are a LOT of people requesting this, I'm happy to take it if you're low on bandwidth :)

zucchini-nlp commented 4 weeks ago

@gante definitely, I will post on out internal slack with other related issues I encountered on my way to tuning VLMs. And yes, I can take on this and open a PR next week (or after I come back in July) :)

gante commented 4 days ago

(@zucchini-nlp assigning to you as per your comment above 🤗 )

huggingface / transformers