unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.03k stars 1.25k forks source link

How to use this as the reference policy? #1167

Open serendipity800 opened 2 weeks ago

serendipity800 commented 2 weeks ago

I'm trying to use unsloth, PPOTrainer and vllm to write a RLHF trainer. Now their is a problem that when I called model.eval() and model.train(), it actuall gives different logprobs(it can't be dropout, since a training model gives stable logprobs, and temperature isn't the issue). this could cause problems like negative KL.

serendipity800 commented 2 weeks ago

In theory, the vllm.generate, unsloth model(under training = True) and unsloth(under training = False) should give the same logits under same policy and generation condition(plus top_k = 0), but they are actually different(I copy the weights of policy to vllm before every rollout)