meta-llama / llama

Inference code for Llama models
Other
56.34k stars 9.56k forks source link

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #380

Open Liyan06 opened 1 year ago

Liyan06 commented 1 year ago
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

inputs = ...
inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)

model.generate(**inputs, **generate_kwargs)

RuntimeError: probability tensor contains either inf, nan or element < 0

I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.

Does anyone see the same issue?

Feng-Jay commented 2 months ago

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()

# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

It works! But I still curious about the causes of this error and why it is fixed by doing this?

xiaoxin83121 commented 1 month ago

I find if I set the num_beams > 1, both llama and llama2 suffer from the mentioned error.

same for me (working with llama3.1 and llava), weird

hhnqqq commented 1 month ago

I find if I set the num_beams > 1, both llama and llama2 suffer from the mentioned error.

same for me (working with llama3.1 and llava), weird

Code and weight from transformers have bugs for fp16 and bf16. You can try to use bf16. And modify below code

def _prepare_4d_causal_attention_mask_with_cache_position(
    attention_mask: torch.Tensor,
    sequence_length: int,
    target_length: int,
    dtype: torch.dtype,
    device: torch.device,
    min_dtype: float,
    cache_position: torch.Tensor,
    batch_size: int,
):
    """
    Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
    `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.

    Args:
        attention_mask (`torch.Tensor`):
            A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
        sequence_length (`int`):
            The sequence length being processed.
        target_length (`int`):
            The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
        dtype (`torch.dtype`):
            The dtype to use for the 4D attention mask.
        device (`torch.device`):
            The device to plcae the 4D attention mask on.
        min_dtype (`float`):
            The minimum value representable with the dtype `dtype`.
        cache_position (`torch.Tensor`):
            Indices depicting the position of the input sequence tokens in the sequence.
        batch_size (`torch.Tensor`):
            Batch size.
    """
    if attention_mask is not None and attention_mask.dim() == 4:
        # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
        causal_mask = attention_mask
    else:
        causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
        if sequence_length != 1:
            causal_mask = torch.triu(causal_mask, diagonal=1)
        causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
        causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
        if attention_mask is not None:
            causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
            mask_length = attention_mask.shape[-1]
            padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
            padding_mask = padding_mask == 0
            causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
                padding_mask, min_dtype
            )

    return causal_mask

to

def _prepare_4d_causal_attention_mask_with_cache_position(
    attention_mask: torch.Tensor,
    sequence_length: int,
    target_length: int,
    dtype: torch.dtype,
    device: torch.device,
    min_dtype: float,
    cache_position: torch.Tensor,
    batch_size: int,
):
    """
    Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
    `(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.

    Args:
        attention_mask (`torch.Tensor`):
            A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
        sequence_length (`int`):
            The sequence length being processed.
        target_length (`int`):
            The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
        dtype (`torch.dtype`):
            The dtype to use for the 4D attention mask.
        device (`torch.device`):
            The device to plcae the 4D attention mask on.
        min_dtype (`float`):
            The minimum value representable with the dtype `dtype`.
        cache_position (`torch.Tensor`):
            Indices depicting the position of the input sequence tokens in the sequence.
        batch_size (`torch.Tensor`):
            Batch size.
    """
    if attention_mask is not None and attention_mask.dim() == 4:
        # In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
        causal_mask = attention_mask
    else:
        causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, device=device)
        if sequence_length != 1:
            causal_mask = torch.triu(causal_mask, diagonal=1)
        causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
        causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1).to(dtype)
        if attention_mask is not None:
            causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
            mask_length = attention_mask.shape[-1]
            padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
            padding_mask = padding_mask == 0
            causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
                padding_mask, min_dtype
            )

    return causal_mask

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py ##line59

MilenaCCNlab commented 1 month ago

I am having this issue with llama 3.1 models, I've tried just about everything suggested in the prior comments and nothing fixes the issue. Has anyone managed to resolve this?

joann-alvarez commented 1 month ago

@chuanbinp Can you increase the temperature and top_p / k ?

If that works great, if it doesn't, may need somebody else's help 🤗

Even if that were to have the effect of avoiding the bug:

joann-alvarez commented 1 month ago

@wukaixingxp > I am having this issue with llama 3.1 models, I've tried just about everything suggested in the prior comments and nothing fixes the issue. Has anyone managed to resolve this?

changing to bfloat16 fixed it for me

wukaixingxp commented 1 month ago

@joann-alvarez I am glad that you found a solution, what did you do? do you just change model.half() in it to model.bfloat16()

joann-alvarez commented 1 month ago

@joann-alvarez I am glad that you found a solution, what did you do? do you just change model.half() in it to model.bfloat16()

@wukaixingxp Since I am using quantization, I changed this within the configuration for BitsAndBytesConfig:

        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )

The change is from bnb_4bit_compute_dtype=torch.float16 to bnb_4bit_compute_dtype=torch.bfloat16.

netsafe commented 1 month ago

Looks like I solved it! here is my text-generation pipe, and after transformers update and trying LLama-3.2 it started to fail even working LLama-3 inference runs. I have not had a chance to test on 3.1 because I don't have it.

outputs = pipe(
    messages,
    max_new_tokens=1536,
    do_sample=True,
    temperature=0.2,
    top_p=0.55,
    top_k=30,
)

my CUDA configuration is:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      Off |   00000000:82:00.0 Off |                  Off |
| N/A   29C    P0             43W /  250W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla P40                      Off |   00000000:84:00.0 Off |                  Off |
| N/A   30C    P0             50W /  250W |       0MiB /  24576MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
MilenaCCNlab commented 1 month ago

I tried setting bfloat16, increasing temperature values, changing the tokenizer specs. setting do_sample=false removed the error but it introduced other problems so it's not really a viable solution.

I tried it literally with the simplest HF example below (to make sure it wasn't due to my inputs to the model) and I still get this error.

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
netsafe commented 1 month ago

I tried setting bfloat16, increasing temperature values, changing the tokenizer specs. setting do_sample=false removed the error but it introduced other problems so it's not really a viable solution.

I tried it literally with the simplest HF example below (to make sure it wasn't due to my inputs to the model) and I still get this error.

import transformers
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

This text works fine for LLama v3 and - with pipeline bfloat16 difference for v3.2 - try adding outputs' parameters as I wrote. My OS is Debian 11 stable, CUDA and nvidia driver are set NOT from apt repository but from the binary run-files downloaded from nvidia.com. In an empty venv I have installed this:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip3 install transformers
pip3 install nltk
pip3 install 'accelerate>=0.26.0'

and the python version is :

(.venv) ss666@ai:~/trans$ python3 -V
Python 3.11.2

nltk is optional here - I need it for my code. The working pirate code from documentation is:

import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-3B-Instruct"
#model_id = "meta-llama/Llama-3-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
#    model_kwargs={"torch_dtype": torch.bfloat16},
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=1536,
    do_sample=True,
    temperature=0.2,
    top_p=0.55,
    top_k=30,
)

print(outputs[0]["generated_text"][-1]["content"])
sswam commented 1 month ago

I'm using this hack for the moment, with Llama 3.1 8B. it's better than crashing. I tried with a retry loop first, but it seems always to fail again after it has failed once, and that makes things slow.

diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 35ca292d9..0714d59e1 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -3166,7 +3166,10 @@ class GenerationMixin:
             if do_sample:
                 probs = nn.functional.softmax(next_token_scores, dim=-1)
                 # TODO (joao): this OP throws "skipping cudagraphs due to ['incompatible ops']", find solution
-                next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
+                try:
+                    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
+                except RuntimeError as e:
+                    next_tokens = torch.argmax(next_token_scores, dim=-1)
             else:
                 next_tokens = torch.argmax(next_token_scores, dim=-1)
MilenaCCNlab commented 1 month ago

So I don't know if python version is the issue (it doesn't seem like it should be?), BUT I created a new virtual environment with python version 3.11.2 (my previous version was >3.12) and did the torch, transformer etc. pip installations from scratch and now I don't get the error anymore.

Note that just creating a new environment with the same python version is an unlikely explanation because I did try that before with python 3.12 and still continued getting the same error.

So if anyone is having this issue and is out of ideas for debugging it: maybe try a different python version, it might solve the issue for you.

This text works fine for LLama v3 and - with pipeline bfloat16 difference for v3.2 - try adding outputs' parameters as I wrote. My OS is Debian 11 stable, CUDA and nvidia driver are set NOT from apt repository but from the binary run-files downloaded from nvidia.com. In an empty venv I have installed this:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip3 install transformers
pip3 install nltk
pip3 install 'accelerate>=0.26.0'

and the python version is :

(.venv) ss666@ai:~/trans$ python3 -V
Python 3.11.2

nltk is optional here - I need it for my code. The working pirate code from documentation is:

import torch
from transformers import pipeline

model_id = "meta-llama/Llama-3.2-3B-Instruct"
#model_id = "meta-llama/Llama-3-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
#    model_kwargs={"torch_dtype": torch.bfloat16},
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipe(
    messages,
    max_new_tokens=1536,
    do_sample=True,
    temperature=0.2,
    top_p=0.55,
    top_k=30,
)

print(outputs[0]["generated_text"][-1]["content"])
Cypress98765 commented 1 month ago

Hell yea

On Thu, Oct 10, 2024, 4:27 AM Milena Rmus @.***> wrote:

So I don't know if python version is the issue (it doesn't seem like it should be?), BUT I created a new virtual environment with python version 3.11.2 (my previous version was >3.12) and did the torch, transformer etc. pip installations from scratch and now I don't get the error anymore.

Note that just creating a new environment with the same python version is an unlikely explanation because I did try that before with python 3.12 and still continued getting the same error.

So if anyone is having this issue and is out of ideas for debugging it: maybe try a different python version, it might solve the issue for you.

This text works fine for LLama v3 and - with pipeline bfloat16 difference for v3.2 - try adding outputs' parameters as I wrote. My OS is Debian 11 stable, CUDA and nvidia driver are set NOT from apt repository but from the binary run-files downloaded from nvidia.com. In an empty venv I have installed this:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 pip3 https://download.pytorch.org/whl/cu124pip3 install transformers pip3 install nltk pip3 install 'accelerate>=0.26.0'

and the python version is :

(.venv) @.***:~/trans$ python3 -V Python 3.11.2

nltk is optional here - I need it for my code. The working pirate code from documentation is:

import torch from transformers import pipeline

model_id = "meta-llama/Llama-3.2-3B-Instruct"

model_id = "meta-llama/Llama-3-8B-Instruct"

pipe = pipeline( "text-generation", model=model_id,

model_kwargs={"torch_dtype": torch.bfloat16},

torch_dtype=torch.bfloat16,
device_map="auto",

) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ]

outputs = pipe( messages, max_new_tokens=1536, do_sample=True, temperature=0.2, top_p=0.55, top_k=30, )

print(outputs[0]["generated_text"][-1]["content"])

— Reply to this email directly, view it on GitHub https://github.com/meta-llama/llama/issues/380#issuecomment-2404570674, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJ64VBOEQJCPLGCNHOWSJLLZ2ZB63AVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBUGU3TANRXGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

artkpv commented 3 weeks ago

Same problem here. The only thing that helped so far is do_sample=False. model = model.bfloat() did not help.

I use:

accelerate==1.0.1 torch==2.5.0 torchaudio==2.5.0 torchvision==0.20.0 Python 3.10.13 nvidia-cublas-cu12==12.4.5.8

I tried on two environments with GPU CUDA 12.1 and 12.5

one:

    root@C.13215517:~$ nvidia-smi
    Fri Oct 18 08:15:41 2024
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.1     |

two:


root@C.13217001:~$ nvidia-smi
Fri Oct 18 08:33:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04              Driver Version: 555.52.04      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  |   00000000:05:00.0 Off |                  Off |
| 30%   39C    P8             26W /  300W |       2MiB /  49140MiB |      0%      Default |
> pip freeze

accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.6.2.post1
async-timeout==4.0.3
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
datasets==3.0.1
dill==0.3.8
distro==1.9.0
docker-pycreds==0.4.0
docstring_parser==0.16
exceptiongroup==1.2.2
filelock==3.16.1
frozenlist==1.4.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
h11==0.14.0
httpcore==1.0.6
httpx==0.27.2
huggingface-hub==0.25.2
idna==3.10
iniconfig==2.0.0
jaxtyping==0.2.34
Jinja2==3.1.4
jiter==0.6.1
jiwer==3.0.4
markdown-it-py==3.0.0
MarkupSafe==3.0.1
mdurl==0.1.2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.1
numpy==2.1.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.52.0
packaging==24.1
pandas==2.2.3
parse==1.20.2
peft==0.13.2
pillow==11.0.0
platformdirs==4.3.6
pluggy==1.5.0
propcache==0.2.0
protobuf==5.28.2
psutil==6.1.0
pyarrow==17.0.0
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pytest==8.3.3
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
RapidFuzz==3.10.0
regex==2024.9.11
requests==2.32.3
rich==13.9.2
safetensors==0.4.5
sentencepiece==0.2.0
sentry-sdk==2.17.0
setproctitle==1.3.3
shtab==1.7.1
simple-parsing==0.1.6
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sympy==1.13.1
tokenizers==0.20.1
tomli==2.0.2
torch==2.5.0
torchaudio==2.5.0
torchvision==0.20.0
tqdm==4.66.5
transformers==4.45.2
triton==3.1.0
-e git+https://github.com/huggingface/trl.git@41fe228654005b721240f32716cedf2c1d03f6e1#egg=trl&subdirectory=../../../lib/trl
typeguard==2.13.3
typing_extensions==4.12.2
tyro==0.8.12
tzdata==2024.2
urllib3==2.2.3
wandb==0.18.5
xxhash==3.5.0
yarl==1.15.4
artkpv commented 3 weeks ago

torch==2.5.0

Continuing the above post. I solved this by downgrading to torch 2.4. (in requirements.txt i put `torch==2.4.and it works as before ( no "probability tensor contains eitherinf,nan` or element < 0" error). They released torch 2.5.0 yesterday, 17 Oct 2025.