[bug]: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

tontan1998 commented 1 year ago

Hello! I try to use gpt2-sentiment_peft.py and I was change the model from edbeeching/gpt-neo-125M-imdb to edbeeching/gpt-neo-1.3B-imdb but It is not working.

# CUDA_VISIBLE_DEVICES=0 python -i gpt2-sentiment_peft.py

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/mpi/lib64'), PosixPath('/usr/local/nvidia/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/lib:/usr/local/mpi/lib:/usr/local/mpi/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8888'), PosixPath('http'), PosixPath('//5dd70494425c')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('America/Los_Angeles')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
Found cached dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-b13b5100b4cec2ba.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-3ba5836fcade7393.arrow
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
trainable params: 3147777 || all params: 1318723585 || trainable%: 0.23869877173691406
0it [00:00, ?it/s]/usr/local/lib/python3.8/dist-packages/transformers/models/gpt_neo/modeling_gpt_neo.py:195: UserWarning: where received a uint8 condition tensor. This behavior is deprecated and will be removed in a future version of PyTorch. Use a boolean condition instead. (Triggered internally at ../aten/src/ATen/native/TensorCompare.cpp:413.)
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "gpt2-sentiment_peft.py", line 227, in <module>
    response = ppo_trainer.generate(query, **generation_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/trl/trainer/ppo_trainer.py", line 365, in generate
    response = self.accelerator.unwrap_model(self.model).generate(
  File "/usr/local/lib/python3.8/dist-packages/trl/models/modeling_value_head.py", line 195, in generate
    return self.pretrained_model.generate(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/peft/peft_model.py", line 580, in generate
    return self.base_model.generate(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2504, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

source code

# coding=utf-8
# Copyright 2022 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from dataclasses import dataclass, field
from typing import Optional

import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, pipeline

from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer, set_seed
from trl.core import LengthSampler

tqdm.pandas()

########################################################################
# This is a fully working simple example to use trl with accelerate.
#
# This example fine-tunes a GPT2 model on the IMDB dataset using PPO
# (proximal policy optimization).
# in any of the following settings (with the same script):
#   - single CPU or single GPU
#   - multi GPUS (using PyTorch distributed mode)
#   - multi GPUS (using DeepSpeed ZeRO-Offload stages 1 & 2)
#   - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, first initialize the accelerate
# configuration with `accelerate config`
#
########################################################################

########################################################################
# NOTE for to train with a 8-bit model a more recent version of
# transformers is required, full dependecies for this example:
# pip install  bitsandbytes datasets accelerate loralib
# pip install  git+https://github.com/huggingface/transformers.git@main
# pip install git+https://github.com/huggingface/peft.git
########################################################################

# We first define the configuration of the experiment, defining the model, the dataset,
# the training parameters, and the PPO parameters.
# Check the default arguments in the `PPOConfig` class for more details.
# If you want to log with tensorboard, add the kwarg
# `accelerator_kwargs={"logging_dir": PATH_TO_LOGS}` to the PPOConfig.

# Define and parse arguments.
@dataclass
class ScriptArguments:
    """
    The name of the Casual LM model we wish to fine with PPO
    """

    # NOTE: gpt2 models use Conv1D instead of Linear layers which are not yet supported in 8 bit mode
    # models like gpt-neo* models are more suitable.
    model_name: Optional[str] = field(default="edbeeching/gpt-neo-1.3B-imdb", metadata={"help": "the model name"})
    log_with: Optional[str] = field(default=None, metadata={"help": "use 'wandb' to log with wandb"})
    learning_rate: Optional[float] = field(default=1.41e-5, metadata={"help": "the learning rate"})
    mini_batch_size: Optional[int] = field(default=16, metadata={"help": "the PPO minibatch size"})
    batch_size: Optional[int] = field(default=256, metadata={"help": "the batch size"})
    gradient_accumulation_steps: Optional[int] = field(
        default=1, metadata={"help": "the number of gradient accumulation steps"}
    )

parser = HfArgumentParser(ScriptArguments)
script_args = parser.parse_args_into_dataclasses()[0]

config = PPOConfig(
    model_name=script_args.model_name,
    learning_rate=script_args.learning_rate,
    log_with=script_args.log_with,
    mini_batch_size=script_args.mini_batch_size,
    batch_size=script_args.batch_size,
    gradient_accumulation_steps=script_args.gradient_accumulation_steps,
)

# We then define the arguments to pass to the sentiment analysis pipeline.
# We set `return_all_scores` to True to get the sentiment score for each token.
sent_kwargs = {"return_all_scores": True, "function_to_apply": "none", "batch_size": config.mini_batch_size}

# Below is an example function to build the dataset. In our case, we use the IMDB dataset
# from the `datasets` library. One should customize this function to train the model on
# its own dataset.
def build_dataset(config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8):
    """
    Build dataset for training. This builds the dataset from `load_dataset`, one should
    customize this function to train the model on its own dataset.

    Args:
        dataset_name (`str`):
            The name of the dataset to be loaded.

    Returns:
        dataloader (`torch.utils.data.DataLoader`):
            The dataloader for the dataset.
    """
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # load imdb with datasets
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200, batched=False)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize, batched=False)
    ds.set_format(type="torch")
    return ds

# We retrieve the dataloader by calling the `build_dataset` function.
dataset = build_dataset(config)

def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

# set seed before initializing value head for deterministic eval
set_seed(config.seed)

# Now let's build the main base model! We'll use the `AutoModelForCausalLM` class and load the model in 8 bit mode.
current_device = Accelerator().process_index

pretrained_model = AutoModelForCausalLM.from_pretrained(
    config.model_name, load_in_8bit=True, device_map={"": current_device}
)

tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# Apply LoRA
# Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

pretrained_model = prepare_model_for_int8_training(pretrained_model, layer_norm_names=[])
pretrained_model = get_peft_model(pretrained_model, lora_config)

model = AutoModelForCausalLMWithValueHead.from_pretrained(pretrained_model)
print_trainable_parameters(model)
model.train()

# GPT-2 tokenizer has a pad token, but it is not eos_token by default. We need to set it to eos_token.
# only for this model.
tokenizer.pad_token = tokenizer.eos_token
# We then build the PPOTrainer, passing the model, the reference model, the tokenizer
ppo_trainer = PPOTrainer(config, model, ref_model=None, tokenizer=tokenizer, dataset=dataset, data_collator=collator)

# We then build the sentiment analysis pipeline, passing the model name and the
# sentiment analysis pipeline arguments. Let's also make sure to set the device
# to the same device as the PPOTrainer.
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = current_device if torch.cuda.is_available() else "cpu"  # to avoid a `pipeline` bug
sentiment_pipe = pipeline("sentiment-analysis", model="lvwerra/distilbert-imdb", device=device)

# We then define the arguments to pass to the `generate` function. These arguments
# are passed to the `generate` function of the PPOTrainer, which is a wrapper around
# the `generate` function of the trained model.
generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
    "eos_token_id": -1,
}
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # cache and gradient checkpointing are not compatible, so we switch them on and off here
    model.gradient_checkpointing_disable()
    model.pretrained_model.config.use_cache = True
    # Get response from Causal LM
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in response_tensors]

    # Compute sentiment score
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Run PPO step
    model.gradient_checkpointing_enable()
    model.pretrained_model.config.use_cache = False

    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

model.push_to_hub(f"{script_args.model_name}-ppo-sentiment")

tontan1998 commented 1 year ago

Here are the libraries:

transformers==4.27.1
trl==0.4.1
peft==0.3.0
bitsandbytes==0.37.1
accelerate==0.17.1

lvwerra commented 1 year ago

Looks like model collapsed. Have you tried changing the LR for the larger model? Also training logs would be useful. Maybe @edbeeching knows good setting for the larger models.

akk-123 commented 1 year ago

@tontan1998 did you solve it? I also get this error

younesbelkada commented 1 year ago

Hi @akk-123 @tontan1998 I did not managed to reproduce the issue with edbeeching/gpt-neo-1.3B-imdb Can you try with the most up to date script as it has changed a bit Also I will share with you the output of pip freeze

younesbelkada commented 1 year ago

accelerate==0.18.0
torch==2.0.0
transformers==4.28.0.dev0
peft==0.3.0.dev0

lvwerra commented 1 year ago

Closing the issue for now, feel free to re-open if there's an update.

Opdoop commented 1 year ago

accelerate==0.18.0
torch==2.0.0
transformers==4.28.0.dev0
peft==0.3.0.dev0

I meet the same problem in a docker image environment with this version of packages. The bug has gone after changing to a local machine. The code, model, and package version are all the same. But in the docker image, the code will get the bug.

WakingHours-GitHub commented 1 year ago

i have smae issue, i get a same error when i run.

glsoon commented 1 year ago

I have smae issue, using glm with trl. but not in trl, it works well.

suzyahyah commented 1 year ago

I found that the model weights have gone to NaN after some training steps, causing model(input_ids) to output Nan.

This means that the gradient which are being added to the model weights have NaNs . I simply changed the default option of max_grad_norm=None to max_grad_norm=1 for PPOConfig

My PPOConfig looks like

PPOConfig(batch_size=args.generator.batch_size, seed=args.seed, max_grad_norm=1)

w32zhong commented 1 year ago

I have encountered the same issue. My solution is to reduce learning rate, and disable the "score scaling" feature, the later uses std directly in the denominator which is very risky imo. We should add an epsilon there? @lvwerra @younesbelkada

lvwerra commented 1 year ago

Would be a viable option for me. What do you think @zfang?

zfang commented 1 year ago

Would be a viable option for me. What do you think @zfang?

@lvwerra Created https://github.com/huggingface/trl/pull/727

pearlmary commented 5 months ago

accelerate==0.18.0
torch==2.0.0
transformers==4.28.0.dev0
peft==0.3.0.dev0
I meet the same problem in a docker image environment with this version of packages. The bug has gone after changing to a local machine. The code, model, and package version are all the same. But in the docker image, the code will get the bug.

How did you solve in the docker environment

surak commented 1 month ago

I have this error running Mistral-Nemo-2407 in two RTX 3090 24gb, with FastChat.

If I use only 1 gpu and cpu offloading, this problem does not come up, but I have OOMs with long requests.

huggingface / trl

[bug]: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #229