meta-llama / llama

Inference code for Llama models
Other
56.38k stars 9.57k forks source link

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0 #380

Open Liyan06 opened 1 year ago

Liyan06 commented 1 year ago
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

inputs = ...
inputs = tokenizer.batch_encode_plus(inputs, return_tensors="pt", padding=True)

model.generate(**inputs, **generate_kwargs)

RuntimeError: probability tensor contains either inf, nan or element < 0

I got this error while doing inference for text generation, in particular when the batch size is great than 1. I did not get this error and generate correctly when the batch size is set to 1.

Does anyone see the same issue?

YangZyyyy commented 1 year ago

I'm encountering the same problem. The same error is reported when I batch inference wmt22 testset with the model trained based on llama2.

ycjcl868 commented 1 year ago

same problem

hangzhang-nlp commented 1 year ago

I have the same question!

YangZyyyy commented 1 year ago

model.bfloat16() can solve this problem

Zayd-Jamadar commented 1 year ago

Can you please explain it in detail?

YangZyyyy commented 1 year ago

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()

# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

minwhoo commented 1 year ago

I had the problem solved via setting the pad_token as "[PAD]" and padding_side to "left" as suggested here.

tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

Works with torch.float16

misberner commented 1 year ago

I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token instead of eos_token as the pad_token ("[PAD]" is not in the vocab, so tokenizer.pad_token = "[PAD]" actually sets tokenizer.pad_token_id to tokenizer.unk_token_id). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.

So far, the only thing that seems to work reliably is left-padding with the bos_token.

shaoyangxu commented 1 year ago

I've been able to hit this issue with most of the suggested solutions: using bfloat16, using unk_token instead of eos_token as the pad_token ("[PAD]" is not in the vocab, so tokenizer.pad_token = "[PAD]" actually sets tokenizer.pad_token_id to tokenizer.unk_token_id). Some of these work okay for smaller batches and/or if the number of padding tokens added is limited (i.e., difference of lengths in the batch is comparably small), but with enough length discrepancy or sufficiently large batch size, the error still occurs.

So far, the only thing that seems to work reliably is left-padding with the bos_token.

work for me

misberner commented 1 year ago

Unfortunately I have to report that even when using the bos_token for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.

The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.

AayushSameerShah commented 1 year ago

🌡 Have you tried increasing the temperature?

Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)

So I increased the temperature and it worked.

Try increasing the temp value and it should just work, if there are no other complexity involved.

Yueeeeeeee commented 1 year ago

🌡 Have you tried increasing the temperature?

Well try increasing the temperature value. I had very low temperature value along with other parameters such as top_k and top_p which made the next token distribution too steep and as the beam search's logic, you will need to have multiple tokens available, and in the low temperature case I couldn't have (because we know how temperature works, right?)

So I increased the temperature and it worked.

Try increasing the temp value and it should just work, if there are no other complexity involved.

Increasing temp value to >0.5 works in my case

tczbzb commented 1 year ago

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()

# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

This works for me too.

chuanbinp commented 1 year ago
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

<Tried 3 different methods here>

sentence = "Hello, how are you?"
inputs = tokenizer(sentence, return_tensors="pt", padding=True)
outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)
for i, output in enumerate(outputs):
    print(f"{i}: {tokenizer.decode(output)}")

Tried these 3 methods as suggested above, but to no avail. Method 1: model = model.bfloat16() Method 2:

tokenizer.pad_token = "[PAD]"
tokenizer.padding_side = "left"

Method 3:

tokenizer.pad_token = tokenizer.bos_token
tokenizer.padding_side = "left"

Anyone got an idea how to solve this issue?

AayushSameerShah commented 1 year ago

outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7)

@chuanbinp The num_return_sequences doesn't work if you haven't set do_sample=True. Try using that first and see what happens.

chuanbinp commented 1 year ago

Still getting the same error:

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

Cell In[13], line 4

      1 sentence = "Hello, how are you?"

      3 inputs = tokenizer(sentence, return_tensors="pt", padding=True)

----> 4 outputs = model.generate(**inputs, max_length=50, num_return_sequences=5, temperature=0.7, do_sample=True)

      5 for i, output in enumerate(outputs):

      6     print(f"{i}: {tokenizer.decode(output)}")

File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)

    112 @functools.wraps(func)

    113 def decorate_context(*args, **kwargs):

    114     with ctx_factory():

--> 115         return func(*args, **kwargs)

File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:1652, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)

   1644     input_ids, model_kwargs = self._expand_inputs_for_generation(

   1645         input_ids=input_ids,

   1646         expand_size=generation_config.num_return_sequences,

   1647         is_encoder_decoder=self.config.is_encoder_decoder,

   1648         **model_kwargs,

   1649     )

   1651     # 13. run sample

-> 1652     return self.sample(

   1653         input_ids,

   1654         logits_processor=logits_processor,

   1655         logits_warper=logits_warper,

   1656         stopping_criteria=stopping_criteria,

   1657         pad_token_id=generation_config.pad_token_id,

   1658         eos_token_id=generation_config.eos_token_id,

   1659         output_scores=generation_config.output_scores,

   1660         return_dict_in_generate=generation_config.return_dict_in_generate,

   1661         synced_gpus=synced_gpus,

   1662         streamer=streamer,

   1663         **model_kwargs,

   1664     )

   1666 elif generation_mode == GenerationMode.BEAM_SEARCH:

   1667     # 11. prepare beam search scorer

   1668     beam_scorer = BeamSearchScorer(

   1669         batch_size=batch_size,

   1670         num_beams=generation_config.num_beams,

   (...)

   1675         max_length=generation_config.max_length,

   1676     )

File ~/mambaforge/envs/chuan-llama/lib/python3.11/site-packages/transformers/generation/utils.py:2770, in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)

   2768 # sample

   2769 probs = nn.functional.softmax(next_token_scores, dim=-1)

-> 2770 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)

   2772 # finished sentences should have their next token be a padding token

   2773 if eos_token_id is not None:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
AayushSameerShah commented 1 year ago

@chuanbinp Can you increase the temperature and top_p / k ?

If that works great, if it doesn't, may need somebody else's help 🤗

chuanbinp commented 1 year ago

@chuanbinp Can you increase the temperature and top_p / k ?

If that works great, if it doesn't, may need somebody else's help 🤗

I tried with temperate=0.99, do_sample=True, top_k=50, top_p=0.95 but still facing the same error. Can someone help?

Update: Seems like my files were corrupted. Re-downloading the models worked!

ubaada commented 1 year ago

If anyone is still having this problem, removing do_sample=True fixed it for me. 🤷

Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.

gangooteli commented 11 months ago

model.bfloat16() can solve this problem

This solved my issue. changing from float to bfloat solved my issue.

I tried playing around with different values of temperate, do_sample=True/false, top_k, top_p, gradient_clip, reducing the learning rate. But that didn't solve the issue.

Finally changing from model dtype from float to bfloat solved the issue.

xiaoyaoyang commented 11 months ago

can somebody explain what's the root cause for this and why these walkaround works? thanks a lot!

misberner commented 11 months ago

So, for me changing to bfloat16 did not fix the issue. I didn't try removing do_sample=True, but working with batches of size 1 was the thing that reliably made the error go away.

Without having looked at the internals, the error message and the fact that changing the dtype works for some people but not for others, sounds like a numerical instability problem, where an "almost zero" value happens to be slightly negative due to imprecision in the representation. So I wonder if changing the probs argument to torch.multinomial to something like

torch.maximum(probs, torch.zeros_like(probs))

would help, clamping all probabilities to at least 0. Might be worth a short - the way I look at it, can't be worse than crashing. But not sure if there are any implications on performance, and it might also have unpredictable results if the underlying issue is something else.

callanwu commented 11 months ago

mark

Bingo-W commented 10 months ago

I find if I set the num_beams > 1, both llama and llama2 suffer from the mentioned error.

yaokunkun commented 10 months ago

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()

# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

NB!

AnuraktKumar commented 10 months ago

For me, removing the do_sample fixed it. However, I don't understand the root problem. In my case, I used a transformers.BitsAndBytesConfig object to quantize the LLM, which already has bnb_4bit_compute_dtype=bfloat16. I was still encountering the issue.

sayhellotoAI2 commented 10 months ago

in my case, llama2 model print(model.config.pad_token_id,"\n",model.config.eos_token_id) returns None and 32000

so change "pad_token=model.config.pad_token_id" to "pad_token=model.config.eos_token_id," in generation config works to me

seanxuu commented 10 months ago

model.bfloat16()可以解决这个问题

这解决了我的问题。从 float 更改为 bfloat 解决了我的问题。

我尝试使用不同的温和值,do_sample=真/假、top_k、top_p、gradient_clip,降低了学习率。但这并没有解决问题。

最后,从模型 dtype 从 float 更改为 bfloat 解决了这个问题。

same. bfloat works well.

jan-grzybek-ampere commented 9 months ago

Hack to get bs > 1 to work, modify https://github.com/huggingface/transformers/blob/772307be7649e1333a933cfaa229dc0dec2fd331/src/transformers/generation/utils.py#L2650C5-L2650C5

probs = nn.functional.softmax(next_token_scores, dim=-1)

nans = torch.isnan(probs)
if nans.any(): 
   idx = torch.argwhere(torch.sum(nans, 1))
   z = torch.zeros_like(probs[idx][0])
   z[0][2] = 1.
   probs[idx] = z

next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)

this replaces every row with NaNs with 1.0 logit for eos token (having idx=2)

kevinz8866 commented 9 months ago

I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.

zolastro commented 9 months ago

If anyone is still having this problem, removing do_sample=True fixed it for me. 🤷

Edit: Although it now generates 0th token after a couple of valid tokens. So underlying problem wasn't solved. This is happening while using bitsandbytes library with Mistral 7b.

Without do_sample=True it will simply ignore the temperature, top_k and top_p parameters.

complete-dope commented 9 months ago

I figure out mine. It was mainly because you introduced nan during fine-tuning. In my case was quantized model outputed nan in the logits when running DPO.

How to solved this error then ?

RanchiZhao commented 8 months ago

mark here

dragondog129 commented 8 months ago

I got this error when Llama2-7b-chat inference with nf4_config quantification. nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) I solved this by change torch.bfloat16 to torch.float16.

HaritzPuerto commented 7 months ago

I encountered this problem when I set num_return_sequences > 1. When it's 1, it works well

swizzcheeze commented 6 months ago

I got this issue when I was messing around with Grammar Files and forgot I had one loaded up

ANYMS-A commented 6 months ago

Only setting do_sample=False solved my issue. Don't know the exact reason about this.

swizzcheeze commented 6 months ago

Okay, thank you

On Mon, May 6, 2024, 2:43 AM allan @.***> wrote:

Only setting do_sample=False solved my issue. Don't know the exact reason about this.

— Reply to this email directly, view it on GitHub https://github.com/meta-llama/llama/issues/380#issuecomment-2095375317, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDO3UWNP25XBQQVACQG7TDZA4YDJAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVGM3TKMZRG4 . You are receiving this because you commented.Message ID: @.***>

Gannn12138 commented 6 months ago

i have a question that sometimes i can successfully run the code,but sometimes can't,after i try bflops or temp

YuWei-CH commented 5 months ago

Hi. I have encountered this error when utilising the llama3 8b and llama 70b inference on a workstation with multiple GPUs (6*Mi250). However, I have not experienced this problem when using a single GPU (Mi250). I hope that this can bring some help.

joann-alvarez commented 4 months ago

For me, removing the do_sample fixed it. However, I don't understand the root problem. In my case, I used a transformers.BitsAndBytesConfig object to quantize the LLM, which already has bnb_4bit_compute_dtype=bfloat16. I was still encountering the issue. @AnuraktKumar It sounds like you finetuned the model with that dtype (bfloat16), but did you also load the model with that BitsAndBytesConfig?

patrick-tssn commented 3 months ago

Does anybody figure out the reason behind this phenomenon?

swizzcheeze commented 3 months ago

Increase your page pagefile

On Tue, Jul 16, 2024, 1:56 AM Yuxuan Wang @.***> wrote:

Does anybody figure out the reason behind this phenomenon?

— Reply to this email directly, view it on GitHub https://github.com/meta-llama/llama/issues/380#issuecomment-2230159181, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDO3UXCE5S4RP4PKZM3IN3ZMS73FAVCNFSM6AAAAAA2PECPF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQGE2TSMJYGE . You are receiving this because you commented.Message ID: @.***>

RucLee commented 3 months ago

Only setting do_sample=False solved my issue. Don't know the exact reason about this.

+1

xlnn commented 3 months ago

Here's my previous code, when it runs, the error was reported

RuntimeError: probability tensor contains either inf, nan or element < 0

from transformers import LlamaForCausalLM

model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
model = model.half().cuda()

# here is the code for batch inference
# ...

I modified model.half() in it to mode.bfloat16(), the error was sovled. I'm guessing that there are some problems with llama2's weights at FP16...

That is good. It solve my problem! Thank you!

LuckMonkeys commented 2 months ago

I encountered this problem under 8-bit quantized Llama2-7b-hf model, that fixed my problem when batch_size=4

Unfortunately I have to report that even when using the bos_token for left-padding, the error sometimes occurs. In my case, this happened with an inference batch size of 8 and only after successfully generating 128+ responses (16+ batches). I didn't log more detailed statistics unfortunately, but I suspect that some combination of length difference between longest and shortest prompt as well as absolute length of longest prompt might be at play.

The stopgap solution in my code is to catch the error and then switch to only batching together prompts with the same token length (prompts are already sorted by token length to minimize the number of padding tokens necessary). However, that means a considerable drop in # responses/second once this error occurs.

colourfulspring commented 2 months ago

Only set tokenizer.pad_token = tokenizer.bos_token solved my issue. Don't know the reason. Thanks everyone for discussing!


(After an hour) The error occurred again. I found its occurrence relating with the data, which means it occurred when fed in some data, but disappeared when fed in other data. I don't understand why.


(After another hour)

Only setting do_sample=False solved my issue. Don't know the exact reason about this.

Same as you. I don't know why, either.

practicingman commented 2 months ago

because the error was raised at .no sample no raise


           if do_sample:
                probs = nn.functional.softmax(next_token_scores, dim=-1)
                next_tokens = torch.multinomial(probs, num_samples=n_tokens_to_keep)
                next_token_scores = torch.gather(next_token_scores, -1, next_tokens)
                next_token_scores, _indices = torch.sort(next_token_scores, descending=True, dim=1)
                next_tokens = torch.gather(next_tokens, -1, _indices)
            else:
                next_token_scores, next_tokens = torch.topk(
                    next_token_scores, n_tokens_to_keep, dim=1, largest=True, sorted=True
                )```
Davido111200 commented 2 months ago

Mark

cyber-chris commented 2 months ago

As mentioned by others, you can avoid this by simply disabling sampling with do_sample=False.

When you enable sampling, it looks at the probabilities of the tokens and does clever things to generate completions based on those probabilities (depending on your generation kwargs). You can imagine that for some values of logits, if you convert those logit scores to probabilities, you might run into nonsense values, i.e. inf, nan, etc.

Of course, if you do need sampling, e.g. maybe with PPO training where you want diverse completions, then you need to carefully look at your generation kwargs.