How to use this as the reference policy?

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.58k stars 1.3k forks source link

How to use this as the reference policy? #1167

Open serendipity800 opened 1 month ago

serendipity800 commented 1 month ago

I'm trying to use unsloth, PPOTrainer and vllm to write a RLHF trainer. Now their is a problem that when I called model.eval() and model.train(), it actuall gives different logprobs(it can't be dropout, since a training model gives stable logprobs, and temperature isn't the issue). this could cause problems like negative KL.

serendipity800 commented 1 month ago

In theory, the vllm.generate, unsloth model(under training = True) and unsloth(under training = False) should give the same logits under same policy and generation condition(plus top_k = 0), but they are actually different(I copy the weights of policy to vllm before every rollout)