Open Liyan06 opened 1 year ago
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) model = model.half().cuda() # here is the code for batch inference # ...
I modified
model.half()
in it tomode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
It works! But I still curious about the causes of this error and why it is fixed by doing this?
I find if I set the num_beams > 1, both llama and llama2 suffer from the mentioned error.
same for me (working with llama3.1 and llava), weird
I find if I set the num_beams > 1, both llama and llama2 suffer from the mentioned error.
same for me (working with llama3.1 and llava), weird
Code and weight from transformers have bugs for fp16 and bf16. You can try to use bf16. And modify below code
def _prepare_4d_causal_attention_mask_with_cache_position(
attention_mask: torch.Tensor,
sequence_length: int,
target_length: int,
dtype: torch.dtype,
device: torch.device,
min_dtype: float,
cache_position: torch.Tensor,
batch_size: int,
):
"""
Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
`(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
Args:
attention_mask (`torch.Tensor`):
A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
sequence_length (`int`):
The sequence length being processed.
target_length (`int`):
The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
dtype (`torch.dtype`):
The dtype to use for the 4D attention mask.
device (`torch.device`):
The device to plcae the 4D attention mask on.
min_dtype (`float`):
The minimum value representable with the dtype `dtype`.
cache_position (`torch.Tensor`):
Indices depicting the position of the input sequence tokens in the sequence.
batch_size (`torch.Tensor`):
Batch size.
"""
if attention_mask is not None and attention_mask.dim() == 4:
# In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
causal_mask = attention_mask
else:
causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
if sequence_length != 1:
causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
padding_mask = padding_mask == 0
causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
padding_mask, min_dtype
)
return causal_mask
to
def _prepare_4d_causal_attention_mask_with_cache_position(
attention_mask: torch.Tensor,
sequence_length: int,
target_length: int,
dtype: torch.dtype,
device: torch.device,
min_dtype: float,
cache_position: torch.Tensor,
batch_size: int,
):
"""
Creates a causal 4D mask of shape `(batch_size, 1, query_length, key_value_length)` from a 2D mask of shape
`(batch_size, key_value_length)`, or if the input `attention_mask` is already 4D, do nothing.
Args:
attention_mask (`torch.Tensor`):
A 2D attention mask of shape `(batch_size, key_value_length)` or a 4D attention mask of shape `(batch_size, 1, query_length, key_value_length)`.
sequence_length (`int`):
The sequence length being processed.
target_length (`int`):
The target length: when generating with static cache, the mask should be as long as the static cache, to account for the 0 padding, the part of the cache that is not filled yet.
dtype (`torch.dtype`):
The dtype to use for the 4D attention mask.
device (`torch.device`):
The device to plcae the 4D attention mask on.
min_dtype (`float`):
The minimum value representable with the dtype `dtype`.
cache_position (`torch.Tensor`):
Indices depicting the position of the input sequence tokens in the sequence.
batch_size (`torch.Tensor`):
Batch size.
"""
if attention_mask is not None and attention_mask.dim() == 4:
# In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
causal_mask = attention_mask
else:
causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, device=device)
if sequence_length != 1:
causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1).to(dtype)
if attention_mask is not None:
causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
padding_mask = padding_mask == 0
causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
padding_mask, min_dtype
)
return causal_mask
I am having this issue with llama 3.1 models, I've tried just about everything suggested in the prior comments and nothing fixes the issue. Has anyone managed to resolve this?
@chuanbinp Can you increase the temperature and top_p / k ?
If that works great, if it doesn't, may need somebody else's help 🤗
Even if that were to have the effect of avoiding the bug:
@wukaixingxp > I am having this issue with llama 3.1 models, I've tried just about everything suggested in the prior comments and nothing fixes the issue. Has anyone managed to resolve this?
changing to bfloat16 fixed it for me
@joann-alvarez I am glad that you found a solution, what did you do? do you just change model.half()
in it to model.bfloat16()
@joann-alvarez I am glad that you found a solution, what did you do? do you just change
model.half()
in it tomodel.bfloat16()
@wukaixingxp Since I am using quantization, I changed this within the configuration for BitsAndBytesConfig:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
The change is from bnb_4bit_compute_dtype=torch.float16
to bnb_4bit_compute_dtype=torch.bfloat16
.
Looks like I solved it! here is my text-generation pipe, and after transformers update and trying LLama-3.2 it started to fail even working LLama-3 inference runs. I have not had a chance to test on 3.1 because I don't have it.
outputs = pipe(
messages,
max_new_tokens=1536,
do_sample=True,
temperature=0.2,
top_p=0.55,
top_k=30,
)
my CUDA configuration is:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla P40 Off | 00000000:82:00.0 Off | Off |
| N/A 29C P0 43W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla P40 Off | 00000000:84:00.0 Off | Off |
| N/A 30C P0 50W / 250W | 0MiB / 24576MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
I tried setting bfloat16, increasing temperature values, changing the tokenizer specs. setting do_sample=false removed the error but it introduced other problems so it's not really a viable solution.
I tried it literally with the simplest HF example below (to make sure it wasn't due to my inputs to the model) and I still get this error.
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
I tried setting bfloat16, increasing temperature values, changing the tokenizer specs. setting do_sample=false removed the error but it introduced other problems so it's not really a viable solution.
I tried it literally with the simplest HF example below (to make sure it wasn't due to my inputs to the model) and I still get this error.
import transformers import torch model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct" pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", ) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] outputs = pipeline( messages, max_new_tokens=256, ) print(outputs[0]["generated_text"][-1])
This text works fine for LLama v3 and - with pipeline bfloat16 difference for v3.2 - try adding outputs' parameters as I wrote. My OS is Debian 11 stable, CUDA and nvidia driver are set NOT from apt repository but from the binary run-files downloaded from nvidia.com. In an empty venv I have installed this:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip3 install transformers
pip3 install nltk
pip3 install 'accelerate>=0.26.0'
and the python version is :
(.venv) ss666@ai:~/trans$ python3 -V
Python 3.11.2
nltk is optional here - I need it for my code. The working pirate code from documentation is:
import torch
from transformers import pipeline
model_id = "meta-llama/Llama-3.2-3B-Instruct"
#model_id = "meta-llama/Llama-3-8B-Instruct"
pipe = pipeline(
"text-generation",
model=model_id,
# model_kwargs={"torch_dtype": torch.bfloat16},
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipe(
messages,
max_new_tokens=1536,
do_sample=True,
temperature=0.2,
top_p=0.55,
top_k=30,
)
print(outputs[0]["generated_text"][-1]["content"])
I'm using this hack for the moment, with Llama 3.1 8B. it's better than crashing. I tried with a retry loop first, but it seems always to fail again after it has failed once, and that makes things slow.
diff --git a/src/transformers/generation/utils.py b/src/transformers/generation/utils.py
index 35ca292d9..0714d59e1 100644
--- a/src/transformers/generation/utils.py
+++ b/src/transformers/generation/utils.py
@@ -3166,7 +3166,10 @@ class GenerationMixin:
if do_sample:
probs = nn.functional.softmax(next_token_scores, dim=-1)
# TODO (joao): this OP throws "skipping cudagraphs due to ['incompatible ops']", find solution
- next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
+ try:
+ next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
+ except RuntimeError as e:
+ next_tokens = torch.argmax(next_token_scores, dim=-1)
else:
next_tokens = torch.argmax(next_token_scores, dim=-1)
So I don't know if python version is the issue (it doesn't seem like it should be?), BUT I created a new virtual environment with python version 3.11.2 (my previous version was >3.12) and did the torch, transformer etc. pip installations from scratch and now I don't get the error anymore.
Note that just creating a new environment with the same python version is an unlikely explanation because I did try that before with python 3.12 and still continued getting the same error.
So if anyone is having this issue and is out of ideas for debugging it: maybe try a different python version, it might solve the issue for you.
This text works fine for LLama v3 and - with pipeline bfloat16 difference for v3.2 - try adding outputs' parameters as I wrote. My OS is Debian 11 stable, CUDA and nvidia driver are set NOT from apt repository but from the binary run-files downloaded from nvidia.com. In an empty venv I have installed this:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 pip3 install transformers pip3 install nltk pip3 install 'accelerate>=0.26.0'
and the python version is :
(.venv) ss666@ai:~/trans$ python3 -V Python 3.11.2
nltk is optional here - I need it for my code. The working pirate code from documentation is:
import torch from transformers import pipeline model_id = "meta-llama/Llama-3.2-3B-Instruct" #model_id = "meta-llama/Llama-3-8B-Instruct" pipe = pipeline( "text-generation", model=model_id, # model_kwargs={"torch_dtype": torch.bfloat16}, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] outputs = pipe( messages, max_new_tokens=1536, do_sample=True, temperature=0.2, top_p=0.55, top_k=30, ) print(outputs[0]["generated_text"][-1]["content"])
Hell yea
On Thu, Oct 10, 2024, 4:27 AM Milena Rmus @.***> wrote:
So I don't know if python version is the issue (it doesn't seem like it should be?), BUT I created a new virtual environment with python version 3.11.2 (my previous version was >3.12) and did the torch, transformer etc. pip installations from scratch and now I don't get the error anymore.
Note that just creating a new environment with the same python version is an unlikely explanation because I did try that before with python 3.12 and still continued getting the same error.
So if anyone is having this issue and is out of ideas for debugging it: maybe try a different python version, it might solve the issue for you.
This text works fine for LLama v3 and - with pipeline bfloat16 difference for v3.2 - try adding outputs' parameters as I wrote. My OS is Debian 11 stable, CUDA and nvidia driver are set NOT from apt repository but from the binary run-files downloaded from nvidia.com. In an empty venv I have installed this:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 pip3 https://download.pytorch.org/whl/cu124pip3 install transformers pip3 install nltk pip3 install 'accelerate>=0.26.0'
and the python version is :
(.venv) @.***:~/trans$ python3 -V Python 3.11.2
nltk is optional here - I need it for my code. The working pirate code from documentation is:
import torch from transformers import pipeline
model_id = "meta-llama/Llama-3.2-3B-Instruct"
model_id = "meta-llama/Llama-3-8B-Instruct"
pipe = pipeline( "text-generation", model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
torch_dtype=torch.bfloat16, device_map="auto",
) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ]
outputs = pipe( messages, max_new_tokens=1536, do_sample=True, temperature=0.2, top_p=0.55, top_k=30, )
print(outputs[0]["generated_text"][-1]["content"])
— Reply to this email directly, view it on GitHub https://github.com/meta-llama/llama/issues/380#issuecomment-2404570674, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJ64VBOEQJCPLGCNHOWSJLLZ2ZB63AVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBUGU3TANRXGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Same problem here. The only thing that helped so far is do_sample=False
. model = model.bfloat()
did not help.
I use:
accelerate==1.0.1 torch==2.5.0 torchaudio==2.5.0 torchvision==0.20.0 Python 3.10.13 nvidia-cublas-cu12==12.4.5.8
I tried on two environments with GPU CUDA 12.1 and 12.5
one:
root@C.13215517:~$ nvidia-smi
Fri Oct 18 08:15:41 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.1 |
two:
root@C.13217001:~$ nvidia-smi
Fri Oct 18 08:33:36 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... On | 00000000:05:00.0 Off | Off |
| 30% 39C P8 26W / 300W | 2MiB / 49140MiB | 0% Default |
> pip freeze
accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
annotated-types==0.7.0
anyio==4.6.2.post1
async-timeout==4.0.3
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.4.0
click==8.1.7
datasets==3.0.1
dill==0.3.8
distro==1.9.0
docker-pycreds==0.4.0
docstring_parser==0.16
exceptiongroup==1.2.2
filelock==3.16.1
frozenlist==1.4.1
fsspec==2024.6.1
gitdb==4.0.11
GitPython==3.1.43
h11==0.14.0
httpcore==1.0.6
httpx==0.27.2
huggingface-hub==0.25.2
idna==3.10
iniconfig==2.0.0
jaxtyping==0.2.34
Jinja2==3.1.4
jiter==0.6.1
jiwer==3.0.4
markdown-it-py==3.0.0
MarkupSafe==3.0.1
mdurl==0.1.2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.1
numpy==2.1.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
openai==1.52.0
packaging==24.1
pandas==2.2.3
parse==1.20.2
peft==0.13.2
pillow==11.0.0
platformdirs==4.3.6
pluggy==1.5.0
propcache==0.2.0
protobuf==5.28.2
psutil==6.1.0
pyarrow==17.0.0
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pytest==8.3.3
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
RapidFuzz==3.10.0
regex==2024.9.11
requests==2.32.3
rich==13.9.2
safetensors==0.4.5
sentencepiece==0.2.0
sentry-sdk==2.17.0
setproctitle==1.3.3
shtab==1.7.1
simple-parsing==0.1.6
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
sympy==1.13.1
tokenizers==0.20.1
tomli==2.0.2
torch==2.5.0
torchaudio==2.5.0
torchvision==0.20.0
tqdm==4.66.5
transformers==4.45.2
triton==3.1.0
-e git+https://github.com/huggingface/trl.git@41fe228654005b721240f32716cedf2c1d03f6e1#egg=trl&subdirectory=../../../lib/trl
typeguard==2.13.3
typing_extensions==4.12.2
tyro==0.8.12
tzdata==2024.2
urllib3==2.2.3
wandb==0.18.5
xxhash==3.5.0
yarl==1.15.4
torch==2.5.0
Continuing the above post. I solved this by downgrading to torch 2.4. (in requirements.txt i put `torch==2.4.and it works as before ( no "probability tensor contains either
inf,
nan` or element < 0" error). They released torch 2.5.0 yesterday, 17 Oct 2025.
RuntimeError: probability tensor contains either
inf
,nan
or element < 0I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.
Does anyone see the same issue?