Weird reward_modeling.py training loss and accuracy

seanexp commented 1 year ago

Hi, thanks for maintaining awesome project.

I slightly modified examples/scripts/reward_modeling.py and found and the tracked training loss and accuracy are so weird.

Here is my modified script.

from dataclasses import dataclass, field

import tyro
from accelerate import Accelerator
from datasets import load_dataset
from tqdm import tqdm
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from trl import RewardConfig, RewardTrainer

tqdm.pandas()

@dataclass
class ScriptArguments:
    model_name: str = "mistralai/Mistral-7B-v0.1"
    """the model name"""
    dataset_name: str = "Anthropic/hh-rlhf"
    """the dataset name"""
    dataset_text_field: str = "text"
    """the text field of the dataset"""
    eval_split: str = "none"
    """the dataset split to evaluate on; default to 'none' (no evaluation)"""
    load_in_8bit: bool = False
    """load the model in 8 bits precision"""
    load_in_4bit: bool = False
    """load the model in 4 bits precision"""
    trust_remote_code: bool = True
    """Enable `trust_remote_code`"""
    reward_config: RewardConfig = field(
        default_factory=lambda: RewardConfig(
            output_dir="output",
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            num_train_epochs=3,
            gradient_accumulation_steps=4,
            gradient_checkpointing=True,
            learning_rate=1.41e-5,
            report_to="tensorboard",
            remove_unused_columns=False,
            optim="adamw_torch",
            logging_steps=10,
            eval_steps=0.1,
            save_steps=0.25,
            bf16=True,
            evaluation_strategy="steps",
            max_length=2048,
        )
    )
    load_in_8bit: bool = False
    """load the model in 8 bits precision"""
    load_in_4bit: bool = False
    """load the model in 4 bits precision"""
    save_path: str = "/path/to/save"

args = tyro.cli(ScriptArguments)

# Step 1: Load the model
if args.load_in_8bit and args.load_in_4bit:
    raise ValueError("You can't load the model in 8 bits and 4 bits at the same time")
elif args.load_in_8bit or args.load_in_4bit:
    quantization_config = BitsAndBytesConfig(
        load_in_8bit=args.load_in_8bit, load_in_4bit=args.load_in_4bit
    )
    # Copy the model to each device
    device_map = {"": Accelerator().local_process_index}
else:
    device_map = None
    quantization_config = None

model = AutoModelForSequenceClassification.from_pretrained(
    args.model_name,
    quantization_config=quantization_config,
    device_map=device_map,
    trust_remote_code=args.trust_remote_code,
    num_labels=1,
)

# Step 2: Load the dataset and pre-process it
tokenizer = AutoTokenizer.from_pretrained(args.model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.pad_token_id = model.config.eos_token_id

# Tokenize chosen/rejected pairs of inputs
# Adapt this section to your needs for custom datasets
def preprocess_function(examples):
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }
    for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
        tokenized_chosen = tokenizer(chosen, truncation=True)
        tokenized_rejected = tokenizer(rejected, truncation=True)

        new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
        new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
        new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
        new_examples["attention_mask_rejected"].append(
            tokenized_rejected["attention_mask"]
        )

    return new_examples

# Preprocess the dataset and filter out examples that are longer than args.max_length
with Accelerator().main_process_first():
    dataset = load_dataset(args.dataset_name)
    train_dataset = dataset["train"]

    train_dataset = train_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=4,
    )
    train_dataset = train_dataset.filter(
        lambda x: len(x["input_ids_chosen"]) <= args.reward_config.max_length
        and len(x["input_ids_rejected"]) <= args.reward_config.max_length
    )

    eval_dataset = dataset["test"]
    eval_dataset = eval_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=4,
    )
    eval_dataset = eval_dataset.filter(
        lambda x: len(x["input_ids_chosen"]) <= args.reward_config.max_length
        and len(x["input_ids_rejected"]) <= args.reward_config.max_length
    )

# Step 4: Define the Trainer
trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args.reward_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

trainer.accelerator.save_model(model, args.save_path)

and below are training loss and accuracy

Training loss converges to 0.691 which means that the reward model cannot tell the difference between chosen and rejected.

Meanwhile, the accuracy on test dataset is close to 1, which means the model can almost perfectly tell the difference between chosen and rejected.

Is this an expected behavior?

younesbelkada commented 1 year ago

Hi @seanexp Thanks for the issue and for your message, I think this might indicate a bug on your dataset, also are you doing full fine-tuning or you're using QLoRA ?

seanexp commented 1 year ago

Hi @younesbelkada !

I was using Anthropic/hh-rlhf from huggingface hub.

Also, full fine-tuning was applied.

I also observed similar pattern from CarperAI/openai_summarize_comparisons dataset.

zhengyanzhao1997 commented 1 year ago

Same issues: 'loss': 0.6914, eval_loss': 0.69140625, 'eval_accuracy': 1.0,

the output logits of chosen and rejected sample both become larger and larger (until 300+ ) during training.

zhengyanzhao1997 commented 1 year ago

@seanexp buddy, have you solve the problem?~~~

seanexp commented 1 year ago

@zhengyanzhao1997

Still, I haven't. I tried to dig into compute_loss and prediction_step but couldn't find the cause

zhengyanzhao1997 commented 1 year ago

Me too... I try to add L2Norm loss on the output logits, but it didn't work 😢

seanexp commented 1 year ago

Well, my rough guess is that the problem has nothing to do with regularization.

But sadly, I have no other hypothesis :(

ZHangZHengEric commented 1 year ago

When all input lengths are padded to the same length, loss decreases normally. [My experimental results]

But I don't understand why this is happening either

seanexp commented 1 year ago

Hi @zhengyanzhao1997

I think I found one potential cause of weird RM behavior.

Currently, RewardTrainer expects tokenizer.padding_side == "right", as can be seen below.

https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L1453-L1456

However, some models' default tokenizer.padding_side is "left" (Mistral-7B for example).

So we should set padding_side = "right" before train/eval.

@younesbelkada If you think the modification is necessary, I'll send PR so that RewardTrainer raise error if tokenizer.padding_side == "left"

seanexp commented 1 year ago

I also found out that examples/scripts/reward_modeling.py work as expected, but simply replacing facebook/opt-350m to EleutherAI/pythia-1.4b shows weird training loss and accuracy.

Now I suspect the werid behavior is due to tokenizer or model config.

zhengyanzhao1997 commented 1 year ago

Hi @zhengyanzhao1997

I think I found one potential cause of weird RM behavior.

Currently, RewardTrainer expects tokenizer.padding_side == "right", as can be seen below.

https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py#L1453-L1456

However, some models' default tokenizer.padding_side is "left" (Mistral-7B for example).

So we should set padding_side = "left" before train/eval.

@younesbelkada If you think the modification is necessary, I'll send PR so that RewardTrainer raise error if tokenizer.padding_side == "left"

After you completed such modifications, was your training successful?～～～

seanexp commented 1 year ago

No, it wasn't. Still, simply changing facebook/opt-350m model to something else leads to weird behavior.

supermancmk commented 1 year ago

No, it wasn't. Still, simply changing facebook/opt-350m model to something else leads to weird behavior.

I have the same issue，have you solved the problem? I also use the Mistral-7b. The training loss is also 0.6934. The chosen and rejected scores are the same, maybe 195 large. The eval acc is 1.0

seanexp commented 1 year ago

So we should set padding_side = "right" before train/eval.

@supermancmk Did you change into padding_side = "right"?. Mistral-7b's default padding_side is left.

If the training doesn't go well afterwards, please leave a comment here

supermancmk commented 1 year ago

So we should set padding_side = "right" before train/eval.

@supermancmk Did you change into padding_side = "right"?. Mistral-7b's default padding_side is left.

If the training doesn't go well afterwards, please leave a comment here Yes, I change the tokenizer.padding_side = "right". But it's also the same problem

seanexp commented 1 year ago

This is very weird... @younesbelkada do you have anything in mind?

supermancmk commented 1 year ago

@seanexp Do you have solved the problem by padding_side = "right"? or other ways solved . And do you try the llama model?

seanexp commented 1 year ago

I tried Mistral-7b, pythia-1.4b, and bloom-560m and only bloom-560m seems to work. I couldn't solved the problem by padding_side = "right".

supermancmk commented 1 year ago

I tried Mistral-7b, pythia-1.4b, and bloom-560m and only bloom-560m seems to work. I couldn't solved the problem by padding_side = "right".

Thanks~

seanexp commented 1 year ago

@zhengyanzhao1997 @supermancmk

I think I found a solution and it seems to work (at least for the early phase of training)

I added nn.init.zeros_(model.score.weight) after model loading and here are the logged metrics

PURPLE: pythia-1.4b model without zero init
GREEN: bloom-560m model without zero init (it worked from the beginning)
BLUE: pythia-1.4b model with zero init

the experiments are done using CarperAI/openai_summarize_comparisons dataset.

Please let me know if this solution works in your case.

zhengyanzhao1997 commented 1 year ago

@seanexp We successfully trained our 7B reward model through zero1 (just turn deepspeed zero3 to zero1)

supermancmk commented 1 year ago

@seanexp We successfully trained our 7B reward model through zero1 (just turn deepspeed zero3 to zero1)

Do you try the deepspeed zero3? Is it work

zhengyanzhao1997 commented 1 year ago

@seanexp We successfully trained our 7B reward model through zero1 (just turn deepspeed zero3 to zero1)

Do you try the deepspeed zero3? Is it work

NO😭

seanexp commented 1 year ago

@supermancmk

I tried deepspeed zero2. Well, I can't see any connection between deepspeed zero stage and this weird behavior...

ftgreat commented 1 year ago

zero3 using deepspeed-chat steps training scripts. same issue. loss = 0.69140625

after nn.init.zeros_(model.score.weight), same issue still.

direct cause is that input of sigmod is zero. so the loss is equal to 0.69140625 when dtype is bfloat16. no related to zero stage settings. And in my case the issue came from data processing: chosen fields and rejected fields are the same. With data fixed, loss is decreased as expected.

paraGONG commented 11 months ago

zero3 using deepspeed-chat steps training scripts. same issue. loss = 0.69140625 after nn.init.zeros_(model.score.weight), same issue still.

direct cause is that input of sigmod is zero. so the loss is equal to 0.69140625 when dtype is bfloat16. no related to zero stage settings. And in my case the issue came from data processing: chosen fields and rejected fields are the same. With data fixed, loss is decreased as expected.

hello! What do you mean about chosen fields and rejected fields are the same?

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Reason-Wang commented 10 months ago

I have the same issue, with loss = 0.6914. Does anyone solved the problem?

vwxyzjn commented 10 months ago

Training loss converges to 0.691 which means that the reward model cannot tell the difference between chosen and rejected.

How is this inferred?

I added nn.init.zeros_(model.score.weight) after model loading and here are the logged metrics

OpenAI's model init is the following if you want to give it a try.

https://github.com/openai/summarize-from-feedback/blob/700967448d10004279f138666442bf1497d0e705/summarize_from_feedback/query_response_model.py#L105-L108

Meanwhile, the accuracy on test dataset is close to 1, which means the model can almost perfectly tell the difference between chosen and rejected.

This doesn't sound very likely. Can you try evaluate the models manually on say 100 samples to see how it actually goes?

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

linyongver commented 9 months ago

Hi, I only achieve about 60% accuracy on the HH eval set by using the example reward modeling script (I have tried facebook/opt-350m and mistral-7b) as the reward models. Does anyone knows why?

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

chongxiaoc commented 8 months ago

I hit same issue here with deepspeed stage 3

I have the same issue, with loss = 0.6914. Does anyone solved the problem?

hank0316 commented 6 months ago

I have the same issue. I trained Pythia models (410m, 1.4b) by RewardTrainer but the loss are just stuck at 0.69. I print the reward of chosen and rejected but they are just really closed, which may indicates that the model did not learned how to distinguish between chosen and rejected completions. I've checked that the completion of chosen and rejected are not the same. Is there anybody know exactly why this error occurs?

@younesbelkada do you have any thought about this issue?

DSKSD commented 6 months ago

Hello, I've fixed the issue by specifying model.config.pad_token_id = tokenizer.pad_token_id. (Make sure your model.config.pad_token_id is not None.)

If the model.config.pad_token_id is set to None, it selects the last padded token for the projection. Please take a look at the internal code of LlamaForSequenceClassification. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1414

scottsuk0306 commented 2 months ago

Some updates on @DSKSD's comments!

If the model.config.pad_token_id is set to None, it selects the last padded token for the projection. Please take a look at the internal code of LlamaForSequenceClassification. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1414

Use this link (I fixed the commit id): https://github.com/huggingface/transformers/blob/0a7af19f4dc868bafc82f35eb7e8d13bac87a594/src/transformers/models/llama/modeling_llama.py#L1390

Also, I verified that :

EleutherAI/pythia-1.4b has no pad token (ref: link)
mistralai/Mistral-7B-v0.1 has no pad token (ref: link).
bigscience/bloom-560m has pad token
facebook/opt-350m has pad token

Can be the explanation behind the https://github.com/huggingface/trl/issues/937#issuecomment-1803332120 👀

huggingface / trl

Weird reward_modeling.py training loss and accuracy #937