huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.31k stars 1.17k forks source link

A question about the SFTTrainer (also a theoretical question about SFT in general) #1083

Closed PradeepKadubandi closed 8 months ago

PradeepKadubandi commented 9 months ago

I have a general question about Supervised Fine Tuning (SFT) for Dialogue applications.

Should the SFT process use the same LM objective (next-token prediction) that is used in pre-training a language model?

The "Dialogue" task is predicting "assistant" tokens, right? Shouldn't the objective be predicting only those tokens? Is one way to do this is to set labels for only assistant tokens and ignore the labels on others?

The SFTTrainer implementation does not set labels - as far as I understand, this leads to "labels" being cloned to "input_ids" and shifted right (within transformers code) leading to using "next-token" prediction objective.

More on a philosophical note - if using the same objective as pre-training for SFT, why shouldn't that be called "Fine Tuning" the model (On a dialogue dataset of course) rather than "Supervised Fine Tuning". What am I missing? Is there a reference paper that explains this well? The right approach to do SFT for Dialogue applications?

PradeepKadubandi commented 9 months ago

It is not obvious hence the question. For example, the InstructGPT paper mentions SFT but mainly redirects to the (seemingly) first attempt at SFT in this paper which talks about a "Summarization" task but not a "Dialogue" task.

In that paper, when human labelers are asked to summarize and then when the paper mentions "Behavioral Cloning" is used to finetune the LLM to adapt to this task, I'd imagine that only "Summary" section is considered label but not the entire prompt/document. Following that principle, for "Dialogue" tasks, intuitively, I'd imagine that only "assistant" turns should be part of labels.

lvwerra commented 9 months ago

We off both options: doing "vanilla" CLM or masking out the user queries: https://huggingface.co/docs/trl/sft_trainer#advanced-usage

I don't think there is a systematic distinction between fine-tuning, supervised fine-tuning or even instruction tuning. Just terms people use to essentially describe the same thing :)

PradeepKadubandi commented 9 months ago

Thank you for the pointer! DataCollatorForCompletionOnlyLM is good to know (and what I was looking for in a sense :-))

About the terms, yeah I can see that these can be loosely interchangeable. Based on my literature reading, I have a view of how they are similar and they are (or should be) different - but perhaps everyone has their own view/interpretation :-)

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.