Open Liyan06 opened 1 year ago
I'm encountering the same problem. The same error is reported when I batch inference wmt22
testset with the model trained based on llama2.
same problem
I have the same question!
model.bfloat16()
can solve this problem
Can you please explain it in detail?
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()
# here is the code for batch inference
# ...
I modified model.half()
in it to mode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
I had the problem solved via setting the pad_token as "[PAD]"
and padding_side to "left"
as suggested here.
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
Works with torch.float16
I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token
instead of eos_token
as the pad_token
("[PAD]"
is not in the vocab, so tokenizer.pad_token = "[PAD]"
actually sets tokenizer.pad_token_id
to tokenizer.unk_token_id
). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.
So far, the only thing that seems to work reliably is left-padding with the bos_token
.
I've been able to hit this issue with most of the suggested solutions: using bfloat16, using
unk_token
instead ofeos_token
as thepad_token
("[PAD]"
is not in the vocab, sotokenizer.pad_token = "[PAD]"
actually setstokenizer.pad_token_id
totokenizer.unk_token_id
). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.So far, the only thing that seems to work reliably is left-padding with the
bos_token
.
work for me
Unfortunately I have to report that even when using the bos_token
for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.
The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.
Well try increasing the temperature
value. I had very low temperature value along with other parameters such as top_k
and top_p
which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)
So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
🌡 Have you tried increasing the temperature?
Well try increasing the
temperature
value. I had very low temperature value along with other parameters such astop_k
andtop_p
which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)So I increased the temperature and it worked.
Try increasing the temp value and it should just work, if there are no other complexity involved.
Increasing temp value to >0.5 works in my case
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) model = model.half().cuda() # here is the code for batch inference # ...
I modified
model.half()
in it tomode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
This works for me too.
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
<Tried 3 different methods here>
sentence = "Hello, how are you?"
inputs = tokenizer(sentence, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
for i, output in enumerate(outputs):
print(f"{i}: {tokenizer.decode(output)}")
Tried these 3 methods as suggested above, but to no avail.
Method 1:
model = model.bfloat16()
Method 2:
tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"
Method 3:
tokenizer.pad_token = tokenizer.bos_token
tokenizer.padding_side = "left"
Anyone got an idea how to solve this issue?
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
@chuanbinp
The num_return_sequences
doesn't work if you haven't set do_sample=True
.
Try using that first and see what happens.
Still getting the same error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[13], line 4
1 sentence = "Hello, how are you?"
3 inputs = tokenizer(sentence, return_tensors="pt", padding=True)
----> 4 outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7, do_sample=True)
5 for i, output in enumerate(outputs):
6 print(f"{i}: {tokenizer.decode(output)}")
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:1652, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1644 input_ids, model_kwargs = self._expand_inputs_for_generation(
1645 input_ids=input_ids,
1646 expand_size=generation_config.num_return_sequences,
1647 is_encoder_decoder=self.config.is_encoder_decoder,
1648 **model_kwargs,
1649 )
1651 # 13. run sample
-> 1652 return self.sample(
1653 input_ids,
1654 logits_processor=logits_processor,
1655 logits_warper=logits_warper,
1656 stopping_criteria=stopping_criteria,
1657 pad_token_id=generation_config.pad_token_id,
1658 eos_token_id=generation_config.eos_token_id,
1659 output_scores=generation_config.output_scores,
1660 return_dict_in_generate=generation_config.return_dict_in_generate,
1661 synced_gpus=synced_gpus,
1662 streamer=streamer,
1663 **model_kwargs,
1664 )
1666 elif generation_mode == GenerationMode.BEAM_SEARCH:
1667 # 11. prepare beam search scorer
1668 beam_scorer = BeamSearchScorer(
1669 batch_size=batch_size,
1670 num_beams=generation_config.num_beams,
(...)
1675 max_length=generation_config.max_length,
1676 )
File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:2770, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2768 # sample
2769 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2770 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
2772 # finished sentences should have their next token be a padding token
2773 if eos_token_id is not None:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
@chuanbinp Can you increase the temperature and top_p / k ?
If that works great, if it doesn't, may need somebody else's help 🤗
@chuanbinp Can you increase the temperature and top_p / k ?
If that works great, if it doesn't, may need somebody else's help 🤗
I tried with temperate=0.99, do_sample=True, top_k=50, top_p=0.95
but still facing the same error. Can someone help?
Update: Seems like my files were corrupted. Re-downloading the models worked!
If anyone is still having this problem, removing do_sample=True
fixed it for me. 🤷
Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.
model.bfloat16()
can solve this problem
This solved my issue. changing from float to bfloat solved my issue.
I tried playing around with different values of temperate, do_sample=True/false, top_k, top_p, gradient_clip, reducing the learning rate. But that didn't solve the issue.
Finally changing from model dtype from float to bfloat solved the issue.
can somebody explain what's the root cause for this and why these walkaround works? thanks a lot!
So, for me changing to bfloat16
did not fix the issue. I didn't try removing do_sample=True
, but working with batches of size 1 was the thing that reliably made the error go away.
Without having looked at the internals, the error message and the fact that changing the dtype works for some people but not for others, sounds like a numerical instability problem, where an "almost zero" value happens to be slightly negative due to imprecision in the representation. So I wonder if changing the probs
argument to torch.multinomial
to something like
torch.maximum(probs, torch.zeros_like(probs))
would help, clamping all probabilities to at least 0. Might be worth a short - the way I look at it, can't be worse than crashing. But not sure if there are any implications on performance, and it might also have unpredictable results if the underlying issue is something else.
mark
I find if I set the num_beams > 1, both llama and llama2 suffer from the mentioned error.
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) model = model.half().cuda() # here is the code for batch inference # ...
I modified
model.half()
in it tomode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
NB!
For me, removing the do_sample fixed it. However, I don't understand the root problem. In my case, I used a transformers.BitsAndBytesConfig object to quantize the LLM, which already has bnb_4bit_compute_dtype=bfloat16. I was still encountering the issue.
in my case, llama2 model print(model.config.pad_token_id,"\n",model.config.eos_token_id) returns None and 32000
so change "pad_token=model.config.pad_token_id" to "pad_token=model.config.eos_token_id," in generation config works to me
model.bfloat16()
可以解决这个问题这解决了我的问题。从 float 更改为 bfloat 解决了我的问题。
我尝试使用不同的温和值,do_sample=真/假、top_k、top_p、gradient_clip,降低了学习率。但这并没有解决问题。
最后,从模型 dtype 从 float 更改为 bfloat 解决了这个问题。
same. bfloat works well.
Hack to get bs > 1 to work, modify https://github.com/huggingface/transformers/blob/772307be7649e1333a933cfaa229dc0dec2fd331/src/transformers/generation/utils.py#L2650C5-L2650C5
probs = nn.functional.softmax(next_token_scores, dim=-1)
nans = torch.isnan(probs)
if nans.any():
idx = torch.argwhere(torch.sum(nans, 1))
z = torch.zeros_like(probs[idx][0])
z[0][2] = 1.
probs[idx] = z
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
this replaces every row with NaNs with 1.0 logit for eos token (having idx=2)
I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.
If anyone is still having this problem, removing
do_sample=True
fixed it for me. 🤷Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.
Without do_sample=True
it will simply ignore the temperature
, top_k
and top_p
parameters.
I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.
How to solved this error then ?
mark here
I got this error when Llama2-7b-chat inference with nf4_config quantification. nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) I solved this by change torch.bfloat16 to torch.float16.
I encountered this problem when I set num_return_sequences
> 1. When it's 1, it works well
I got this issue when I was messing around with Grammar Files and forgot I had one loaded up
Only setting do_sample=False solved my issue. Don't know the exact reason about this.
Okay, thank you
On Mon, May 6, 2024, 2:43 AM allan @.***> wrote:
Only setting do_sample=False solved my issue. Don't know the exact reason about this.
— Reply to this email directly, view it on GitHub https://github.com/meta-llama/llama/issues/380#issuecomment-2095375317, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDO3UWNP25XBQQVACQG7TDZA4YDJAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGM3TKMZRG4 . You are receiving this because you commented.Message ID: @.***>
i have a question that sometimes i can successfully run the code,but sometimes can't,after i try bflops or temp
Hi. I have encountered this error when utilising the llama3 8b and llama 70b inference on a workstation with multiple GPUs (6*Mi250). However, I have not experienced this problem when using a single GPU (Mi250). I hope that this can bring some help.
For me, removing the do_sample fixed it. However, I don't understand the root problem. In my case, I used a transformers.BitsAndBytesConfig object to quantize the LLM, which already has bnb_4bit_compute_dtype=bfloat16. I was still encountering the issue. @AnuraktKumar It sounds like you finetuned the model with that dtype (bfloat16), but did you also load the model with that BitsAndBytesConfig?
Does anybody figure out the reason behind this phenomenon?
Increase your page pagefile
On Tue, Jul 16, 2024, 1:56 AM Yuxuan Wang @.***> wrote:
Does anybody figure out the reason behind this phenomenon?
— Reply to this email directly, view it on GitHub https://github.com/meta-llama/llama/issues/380#issuecomment-2230159181, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDO3UXCE5S4RP4PKZM3IN3ZMS73FAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQGE2TSMJYGE . You are receiving this because you commented.Message ID: @.***>
Only setting do_sample=False solved my issue. Don't know the exact reason about this.
+1
Here's my previous code, when it runs, the error was reported
RuntimeError: probability tensor contains either inf, nan or element < 0
from transformers import LlamaForCausalLM model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) model = model.half().cuda() # here is the code for batch inference # ...
I modified
model.half()
in it tomode.bfloat16()
, the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...
That is good. It solve my problem! Thank you!
I encountered this problem under 8-bit quantized Llama2-7b-hf model, that fixed my problem when batch_size=4
Unfortunately I have to report that even when using the
bos_token
for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.
Only set tokenizer.pad_token = tokenizer.bos_token
solved my issue. Don't know the reason. Thanks everyone for discussing!
(After an hour) The error occurred again. I found its occurrence relating with the data, which means it occurred when fed in some data, but disappeared when fed in other data. I don't understand why.
(After another hour)
Only setting do_sample=False solved my issue. Don't know the exact reason about this.
Same as you. I don't know why, either.
because the error was raised at .no sample no raise
if do_sample:
probs = nn.functional.softmax(next_token_scores, dim=-1)
next_tokens = torch.multinomial(probs, num_samples=n_tokens_to_keep)
next_token_scores = torch.gather(next_token_scores, -1, next_tokens)
next_token_scores, _indices = torch.sort(next_token_scores, descending=True, dim=1)
next_tokens = torch.gather(next_tokens, -1, _indices)
else:
next_token_scores, next_tokens = torch.topk(
next_token_scores, n_tokens_to_keep, dim=1, largest=True, sorted=True
)```
Mark
As mentioned by others, you can avoid this by simply disabling sampling with do_sample=False
.
When you enable sampling, it looks at the probabilities of the tokens and does clever things to generate completions based on those probabilities (depending on your generation kwargs). You can imagine that for some values of logits, if you convert those logit scores to probabilities, you might run into nonsense values, i.e. inf
, nan
, etc.
Of course, if you do need sampling, e.g. maybe with PPO training where you want diverse completions, then you need to carefully look at your generation kwargs.
RuntimeError: probability tensor contains either
inf
,nan
or element < 0I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.
Does anyone see the same issue?