Open serendipity800 opened 1 month ago
In theory, the vllm.generate, unsloth model(under training = True) and unsloth(under training = False) should give the same logits under same policy and generation condition(plus top_k = 0), but they are actually different(I copy the weights of policy to vllm before every rollout)
I'm trying to use unsloth, PPOTrainer and vllm to write a RLHF trainer. Now their is a problem that when I called model.eval() and model.train(), it actuall gives different logprobs(it can't be dropout, since a training model gives stable logprobs, and temperature isn't the issue). this could cause problems like negative KL.