Open steremma opened 9 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Related issue: https://github.com/huggingface/trl/issues/1222
For causal LM fine-tuning with instruction tuning (i.e., completion only) I use the SFTTrainer
from the trl library and it suffers from the same problem. Perhaps the solution I suggested there would provide some necessary insights.
I don't know why this issue is closed but it's really a necessity in order to avoid spending unnecessary computation and easily selecting the best possible model. We should have something that works well across different generative models.
At the moment most causal models generate predictions with shape not the same as labels which causes an error when computing different metrics.
cc @gante @zucchini-nlp @muellerzr
Same issue as #31462 and I am very much pro of having this as feature. After having tried to tune a VLMs with HF Trainer, noticed that there are some points that could be improved.
The main issue here, as I see it, is that decoder-only models need different inputs for generation and for loss calculation. I can propose two options:
input_ids
, since I think passing whole input_ids
and generation_input_ids
is kinda double work. Maybe this can be part of the DataCollatorForLanguageModeling
activated by a flag predict_with_generate
DataCollatorForCompletionTask
but will return generation_input_ids
along with the usual inputsLet me know what you think :)
@zucchini-nlp asking our internal fine-tune folks for suggestions here would be great, I'm sure they have opinions about it (given the related thread in TRL)!
btw, let me know if you have bandwidth to take this issue. Since there are a LOT of people requesting this, I'm happy to take it if you're low on bandwidth :)
@gante definitely, I will post on out internal slack with other related issues I encountered on my way to tuning VLMs. And yes, I can take on this and open a PR next week (or after I come back in July) :)
(@zucchini-nlp assigning to you as per your comment above 🤗 )
Feature request
Besides loss, users often need to report additional metrics throughout the training in order to drive decision making and communicate results, which in the case of Seq2Seq models is elegantly done with the
compute_metrics
argument of theTrainer
. Generative metrics easily fit this framework by settingpredict_with_generate=True
. The same is much less straightforward with a Causal underlying LM. The only "working" approach I found is this: https://github.com/huggingface/transformers/blob/5e11d72d4d0939138fbabfebe9a69d2061519547/examples/pytorch/language-modeling/run_clm.py#L578But I think this is an erroneous calculation: the
logits.argmax(dim=-1)
call does not really generate in inference mode, it "cheats" because of teacher forcing and therefore any metric computed that way is probably inflated. Ideally it would be possible to make the argument passed tocompute_metrics
include a properpredictions
attribute that has been properly generated using the trainers generation config.Motivation
I am always frustrated when I can't observe the learning trajectory of my generative metric (say BLEU/ROUGE) when using a CML even though it is trivial to do when I am using a S2S
Your contribution
If you confirm that this is an issue and important enough to justify a fix I may be able to make a PR but can't promise it