huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.12k stars 1.14k forks source link

DPO rewards stucks at zero #1311

Closed pankayaraj closed 2 months ago

pankayaraj commented 6 months ago

While finetuning Llama from an SFT model trained with lora config I get this type of behavior where both the rewards stay at 0 and the loss never goes down

15 {'loss': 0.6932, 'learning_rate': 1.3724429223744293e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -93.45967864990234, 'logps/chosen': -76.77323150634766, 'logits/rejected': -1.6269360780715942, 'logits/chosen': -1.6115689277648926, 'epoch': 0.38} 16 {'loss': 0.6932, 'learning_rate': 1.3670776255707763e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -96.17804718017578, 'logps/chosen': -74.35616302490234, 'logits/rejected': -1.6258790493011475, 'logits/chosen': -1.5984680652618408, 'epoch': 0.41} 17 {'loss': 0.6932, 'learning_rate': 1.3617123287671234e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -98.59229278564453, 'logps/chosen': -73.85718536376953, 'logits/rejected': -1.6299934387207031, 'logits/chosen': -1.606400489807129, 'epoch': 0.45} 18 {'loss': 0.6932, 'learning_rate': 1.3563470319634702e-05, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/rejected': -93.72036743164062, 'logps/chosen': -79.21975708007812, 'logits/rejected': -1.6140862703323364, 'logits/chosen': -1.5989283323287964, 'epoch': 0.49}

I used the following training arguments.

I tried with both fp16 and bf16

training_args = TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=32, remove_unused_columns=False, num_train_epochs=epochs, output_dir=save_dir, save_steps=1500, logging_first_step=True, logging_steps=5, learning_rate=1.41e-5, optim="rmsprop", warmup_steps=0,

bf16=True,

    #fp16=True,
    )
younesbelkada commented 6 months ago

Hmm, that's a bit weird, cc @kashif if you have any idea not sure if this is a duplicate of https://github.com/huggingface/trl/issues/1236 - might be a HP issue? (that issue though points out about the loss, not the reward)

AlexiaJM commented 6 months ago

@pankayaraj Did you find a solution? I have the same bug.

Mine is at loss=0.6931 instead of .6932 and the logps/rejected goes down to crazy numbers near the end ('logps/rejected': -36880.9296875). I'm using Mistral 7B.

AlexiaJM commented 6 months ago

I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.

kashif commented 6 months ago

Great! Let me know how it goes!On 14. Feb 2024, at 22:31, Alexia Jolicoeur-Martineau @.***> wrote: I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

0nutation commented 6 months ago

@pankayaraj Did you find a solution? I have the same bug. My dpo training loss stucks at 0.6931.

0nutation commented 6 months ago

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

jayachandrakalakutagar commented 6 months ago

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results

0nutation commented 6 months ago

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results

I don't solve it now. What do you mean by "my first iteration i had got good results" ? Why does this indicate some mistake? Where do you think the mistake might come from?

0nutation commented 6 months ago

@AlexiaJM I have the same bug. But installing the dev-build from github of 'trl' and the latest pip version of 'datasets' doesn't work for me.

HI were you able to solve it? i am pretty sure we are making some mistake because on my first iteration i had got good results

Do you solve it?

fc2869 commented 5 months ago

I had the same issue, getting the exact same loss as OP had (0.6932) for a couple of epochs. I resolved it by upgrading trl version from 0.7.4 to 0.7.11 and also lowered the learning rate.

WeiXiongUST commented 5 months ago

Hi, I was working on the reward modeling with the mistral in the past few weeks and encounter the same issue. The problem in my case is that the standard Chat-template prevent the models from handling multi-round chat if you set the pad token as the eon token. Instead, the following modification solves the problem.

tokenizer.add_special_tokens({'pad_token': '[PAD]'}) model.resize_token_embeddings(len(tokenizer))

Hope this helps!

FangHainannn commented 4 months ago

Maybe you should check if your ref_model is a static reference copy of model. There is the following code snippet in DPOTrainer init():

        if ref_model:
            self.ref_model = ref_model
        elif self.is_peft_model or precompute_ref_log_probs:
            # The `model` with adapters turned off will be used as the reference model
            self.ref_model = None
        else:
            self.ref_model = create_reference_model(model) 

I fixed this problem by following what create_reference_model() does.


def create_reference_model(
    model: PreTrainedModelWrapper, num_shared_layers: int = None, pattern: str = None
) -> PreTrainedModelWrapper:
    """
    Creates a static reference copy of a model. Note that model will be in `.eval()` mode.

    Args:
        model (`PreTrainedModelWrapper`): The model to be copied.
        num_shared_layers (`int`, *optional*): The number of initial layers that are shared between both models and kept frozen.
        pattern (`str`, *optional*): The shared layers are selected with a string pattern
            (e.g. "transformer.h.{layer}" for GPT2) and if a custom pattern is necessary it can be passed here.

    Returns
        `PreTrainedModelWrapper`
    """
    if is_deepspeed_zero3_enabled():
        raise ValueError(
            "DeepSpeed ZeRO-3 is enabled and is not compatible with `create_reference_model()`. Please instantiate your reference model directly with `AutoCausalLM.from_pretrained()`."
        )

    parameter_names = [n for n, _ in model.named_parameters()]
    ref_model = deepcopy(model)

    # if no layers are shared, return copy of model
    if num_shared_layers is None:
        for param_name in parameter_names:
            param = ref_model.get_parameter(param_name)
            param.requires_grad = False
        return ref_model.eval()

    # identify layer name pattern
    if pattern is not None:
        pattern = pattern.format(layer=num_shared_layers)
    else:
        for pattern_candidate in LAYER_PATTERNS:
            pattern_candidate = pattern_candidate.format(layer=num_shared_layers)
            if any([pattern_candidate in name for name in parameter_names]):
                pattern = pattern_candidate
                break

    if pattern is None:
        raise ValueError("Layer pattern could not be matched.")

    # divide parameters in shared and unshared parameter lists
    shared_param_list = []
    unshared_param_list = []

    shared_parameter = True
    for name, param in model.named_parameters():
        if pattern in name:
            shared_parameter = False
        if shared_parameter:
            shared_param_list.append(name)
        else:
            unshared_param_list.append(name)

    # create reference of the original parameter if they are shared
    for param_name in shared_param_list:
        param = model.get_parameter(param_name)
        param.requires_grad = False

        ref_param = ref_model.get_parameter(param_name)  # noqa
        ref_param = param  # noqa

    # for all other parameters just make sure they don't use gradients
    for param_name in unshared_param_list:
        param = ref_model.get_parameter(param_name)
        param.requires_grad = False

    if pattern is not None and len(unshared_param_list) == 0:
        logging.warning("Pattern passed or found, but no layers matched in the model. Check for a typo.")

    return ref_model.eval()
younesbelkada commented 3 months ago

From the discussions above and internally, this could be solved by tweaking hyper-parameters. Can you try to play a bit with the learning rate and let us know how it goes?

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

pppa2019 commented 4 weeks ago

I solved the issue on my side, by installing the dev-build from github of 'trl' and the latest pip version of 'datasets'.

Thanks! It also works on my machine :)

lixiaochuan2020 commented 3 weeks ago

@FangHainannn 's solution works. Whether we use create_reference_model() or copy.deepcopy(), or just leave the "ref_model" in DPOTrainer empty(if it is empty, huggingface will create a copy for you as the initial model), all work.

The mistake for me is I wrongly passed the "model" to the "ref_model" parameter, causing the 'model' and 'ref_model' to point to the same object.