Open AML14 opened 13 hours ago
Update: DPO doesn't even work with a code completion task (i.e., neither the input nor output include FIM special tokens) with the base model. As an example, here is the output generated by Qwen/Qwen2.5-Coder-0.5B
for the following input:
// Input:
protected RouteBuilder createRouteBuilder()throws Exception {
return new RouteBuilder() {
// Output:
@Override
public void configure() throws Exception {
from("direct:hello")
.to("mock:hello");
}
};
}<|endoftext|>
And here is the output of the same model after having applied DPO with about 3000 instances, where the prompt is the input and the chosen/rejected are correct/wrong completions:
// Input:
protected RouteBuilder createRouteBuilder()throws Exception {
return new RouteBuilder() {
// Output:
public void configure() throws Exception {
<|fim_middle|>
<|fim_middle|>
<|fim_middle|><|endoftext|>
The model is completely broken after applying DPO.
System Info
Information
Tasks
examples
folderReproduction
The script that I'm using is a slightly modified version of the official
dpo.py
example script. The task that I'm trying to train the model on is FIM, not a chat-related task, therefore I'm using a base model, not an instruct one.Command:
Instance examples from the dataset:
Outputs:
The training works fine, but the model is completely broken after training with DPO. It doesn't even generate completions that are syntactically valid, while the pre-trained model or one that has been fine-tuned with the same data works just fine. For example, given the previous first input, the model generates the following output:
And given the second input, the model generates the following output (note that replacing
<|fim_middle|>
in the input with the output does not produce compilable code):Expected behavior
The resulting model trained with DPO should, at least, produce compilable code, without extra special tokens in the output. Needless to say, it should also improve performance.
Checklist