huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.98k stars 1.26k forks source link

trl/examples/scripts/reward_modeling.py fails with streaming=True (for an IterableDataset) #1054

Closed elliotttruestate closed 9 months ago

elliotttruestate commented 11 months ago

Hi,

I am trying to apply reward modelling to an IterableDataset. I am having an issue with a strange failure mode that I am struggling to debug. I can replicate the same stack trace in the reward_modeling.py example below, by making the following changes:

load_dataset -> change streaming=True train_dataset.map -> remove num_proc because this is invalid for an IterableDataset ScriptArguments.reward_config -> add max_steps=10000 for an IterableDataset.

The code gives the error 'TypeError: Can only concatenate tensors but got <class 'bool'>'. I have attached the code and the stack trace, however it is difficult to debug inside the trainer.train() function. I thought it might have to do with the lazy evaluation creating new columns, and this confusing the trainer, but even after forcing the new columns to be created at initialization, the training gives this same error.

If anyone has any ideas on why this would be please let me know! Thank you!

from dataclasses import dataclass, field
from typing import Optional

import tyro
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig

from trl import RewardConfig, RewardTrainer, is_xpu_available

tqdm.pandas()

@dataclass
class ScriptArguments:
    # model_name: str = "facebook/opt-350m"
    model_name: str = "bert-base-cased"
    """the model name"""
    dataset_name: str = "Anthropic/hh-rlhf"
    """the dataset name"""
    dataset_text_field: str = "text"
    """the text field of the dataset"""
    eval_split: str = "none"
    """the dataset split to evaluate on; default to 'none' (no evaluation)"""
    load_in_8bit: bool = False
    """load the model in 8 bits precision"""
    load_in_4bit: bool = False
    """load the model in 4 bits precision"""
    trust_remote_code: bool = True
    """Enable `trust_remote_code`"""
    reward_config: RewardConfig = field(
        default_factory=lambda: RewardConfig(
            output_dir="output",
            per_device_train_batch_size=64,
            num_train_epochs=1,
            gradient_accumulation_steps=16,
            gradient_checkpointing=True,
            gradient_checkpointing_kwargs={"use_reentrant": False},
            learning_rate=1.41e-5,
            report_to="tensorboard",
            remove_unused_columns=False,
            optim="adamw_torch",
            logging_steps=500,
            evaluation_strategy="no",
            max_length=512,
        )
    )
    use_peft: bool = False
    """whether to use peft"""
    peft_config: Optional[LoraConfig] = field(
        default_factory=lambda: LoraConfig(
            r=16,
            lora_alpha=16,
            bias="none",
            task_type="SEQ_CLS",
            modules_to_save=["scores"],
        ),
    )

# args = tyro.cli(ScriptArguments)
args = ScriptArguments()
args.reward_config.evaluation_strategy = "steps" if args.eval_split != "none" else "no"

# Step 1: Load the model
if args.load_in_8bit and args.load_in_4bit:
    raise ValueError("You can't load the model in 8 bits and 4 bits at the same time")
elif args.load_in_8bit or args.load_in_4bit:
    quantization_config = BitsAndBytesConfig(load_in_8bit=args.load_in_8bit, load_in_4bit=args.load_in_4bit)
    # Copy the model to each device
    device_map = (
        {"": f"xpu:{Accelerator().local_process_index}"}
        if is_xpu_available()
        else {"": Accelerator().local_process_index}
    )
else:
    device_map = None
    quantization_config = None

model = AutoModelForSequenceClassification.from_pretrained(
    args.model_name,
    quantization_config=quantization_config,
    device_map=device_map,
    trust_remote_code=args.trust_remote_code,
    num_labels=1,
)

# Step 2: Load the dataset and pre-process it
tokenizer = AutoTokenizer.from_pretrained(args.model_name)
train_dataset = load_dataset(args.dataset_name, split="train", streaming=True)

# Tokenize chosen/rejected pairs of inputs
# Adapt this section to your needs for custom datasets
def preprocess_function(examples):
    new_examples = {
        "input_ids_chosen": [],
        "attention_mask_chosen": [],
        "input_ids_rejected": [],
        "attention_mask_rejected": [],
    }
    for chosen, rejected in zip(examples["chosen"], examples["rejected"]):
        tokenized_chosen = tokenizer(chosen)
        tokenized_rejected = tokenizer(rejected)

        new_examples["input_ids_chosen"].append(tokenized_chosen["input_ids"])
        new_examples["attention_mask_chosen"].append(tokenized_chosen["attention_mask"])
        new_examples["input_ids_rejected"].append(tokenized_rejected["input_ids"])
        new_examples["attention_mask_rejected"].append(tokenized_rejected["attention_mask"])

    return new_examples

# Preprocess the dataset and filter out examples that are longer than args.max_length
train_dataset = train_dataset.map(
    preprocess_function,
    batched=True,
    num_proc=4,
)
train_dataset = train_dataset.filter(
    lambda x: len(x["input_ids_chosen"]) <= args.reward_config.max_length
    and len(x["input_ids_rejected"]) <= args.reward_config.max_length
)

if args.eval_split == "none":
    eval_dataset = None
else:
    eval_dataset = load_dataset(args.dataset_name, split=args.eval_split)

    eval_dataset = eval_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=4,
    )
    eval_dataset = eval_dataset.filter(
        lambda x: len(x["input_ids_chosen"]) <= args.reward_config.max_length
        and len(x["input_ids_rejected"]) <= args.reward_config.max_length
    )

# Step 4: Define the LoraConfig
if args.use_peft:
    peft_config = args.peft_config
else:
    peft_config = None

# Step 5: Define the Trainer
trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    args=args.reward_config,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
)

trainer.train()
TypeError                                 Traceback (most recent call last)
Cell In[93], line 176
    166 # Step 5: Define the Trainer
    167 trainer = RewardTrainer(
    168     model=model,
    169     tokenizer=tokenizer,
   (...)
    173     peft_config=peft_config,
    174 )
--> 176 trainer.train()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553         hf_hub_utils.enable_progress_bars()
   1554 else:
-> 1555     return inner_training_loop(
   1556         args=args,
   1557         resume_from_checkpoint=resume_from_checkpoint,
   1558         trial=trial,
   1559         ignore_keys_for_eval=ignore_keys_for_eval,
   1560     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1838, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1835     rng_to_sync = True
   1837 step = -1
-> 1838 for step, inputs in enumerate(epoch_iterator):
   1839     total_batched_samples += 1
   1840     if rng_to_sync:

File /opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py:642, in DataLoaderDispatcher.__iter__(self)
    640 self._stop_iteration = False
    641 first_batch = None
--> 642 next_batch, next_batch_info = self._fetch_batches(main_iterator)
    643 batch_index = 0
    644 while not stop_iteration:

File /opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py:605, in DataLoaderDispatcher._fetch_batches(self, iterator)
    603     for _ in range(self.state.num_processes):
    604         batches.append(next(iterator))
--> 605     batch = concatenate(batches, dim=0)
    606 # In both cases, we need to get the structure of the batch that we will broadcast on other
    607 # processes to initialize the tensors with the right shape.
    608 # data_structure, stop_iteration
    609 batch_info = [get_data_structure(batch), False]

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:519, in concatenate(data, dim)
    517     return honor_type(data[0], (concatenate([d[i] for d in data], dim=dim) for i in range(len(data[0]))))
    518 elif isinstance(data[0], Mapping):
--> 519     return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
    520 elif not isinstance(data[0], torch.Tensor):
    521     raise TypeError(f"Can only concatenate tensors but got {type(data[0])}")

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:519, in <dictcomp>(.0)
    517     return honor_type(data[0], (concatenate([d[i] for d in data], dim=dim) for i in range(len(data[0]))))
    518 elif isinstance(data[0], Mapping):
--> 519     return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
    520 elif not isinstance(data[0], torch.Tensor):
    521     raise TypeError(f"Can only concatenate tensors but got {type(data[0])}")

File /opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py:521, in concatenate(data, dim)
    519     return type(data[0])({k: concatenate([d[k] for d in data], dim=dim) for k in data[0].keys()})
    520 elif not isinstance(data[0], torch.Tensor):
--> 521     raise TypeError(f"Can only concatenate tensors but got {type(data[0])}")
    522 return torch.cat(data, dim=dim)

TypeError: Can only concatenate tensors but got <class 'bool'>
elliotttruestate commented 11 months ago

If I add a print output to the error in data_loader.py, to print(batches) at line 604, I get the following output. Should this parameter 'return_loss': True be here, and should the function be concatenating across dictionaries like this?

[{'input_ids_chosen': tensor([[ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0],
        ...,
        [ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0]]), 'attention_mask_chosen': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'input_ids_rejected': tensor([[ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0],
        ...,
        [ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0],
        [ 101, 4243,  131,  ...,    0,    0,    0]]), 'attention_mask_rejected': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'return_loss': True}]
elliotttruestate commented 11 months ago

Commenting out the return_loss key on the batch dictionary in RewardDataCollatorWithPadding (trainer/utils.py line 256) resolves this error. However, I have no idea what this parameter does, so I don't know if this should be a pull request. If someone more involved understands how this function is meant to work please let me know. I'll keep this issue open for the meantime in case anyone else comes across it.

Thank you.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Trangle commented 10 months ago

May be the same issue as: TypeError: Can only concatenate tensors but got <class 'bool'>

when using dpo_trainer. The map to token_id do not remove original columns!

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

armsp commented 5 months ago

I think this issue should be reopened, because Reward Modelling or DPO do not work with IterableDataset. This error still persists.