Llama-3 70b exl2 NaNs during generation. exllama_HF

Describe the bug

At first I downloaded https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-5.0bpw-h6-exl2. Using any sampling within textgen caused a NaN error while using any kind of sampling, especially more advanced stuff like min_P. Adding the pad token to the config makes it gen within the UI most of the time.

Next I downloaded https://huggingface.co/turboderp/Llama-3-70B-Instruct-exl2/tree/5.0bpw and this one works within the webui but still causes NaN over the API.

I have tried different environments and git of transformers as well as torch 2.2.2 and 2.2.1 Streaming and not. The 8b model works fine and so do other models. Turning off do-sample allows it to gen. Hope I'm not the only one with this issue.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Load lonestriker quant and try to generate within the webui. Load turboderp quant and generate over the openAI api.

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/home/supermicro/ai/text-generation-webui-testing/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/supermicro/ai/text-generation-webui-testing/modules/text_generation.py", line 382, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/transformers/generation/utils.py", line 2829, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 0.56 seconds (0.00 tokens/s, 0 tokens, context 701, seed 1652796974)

System Info

Ubunutu. Both cuda 12 and 11.8

Can you post your request parameters? The first model I used didn't have this problem. Random settings in my test.

Python 3.10.12

Name: torch Version: 2.2.1+cu121

Name: transformers Version: 4.40.0

Name: exllamav2 Version: 0.0.18+cu121

/v1/chat/completions

{
    "max_tokens": 256,
    "temperature": 1.05,
    "top_p": 0.8,
    "min_p": 0.2,
    "top_k": 10,
    "typical_p": 0.8,
    "top_a": 10,
    "repetition_penalty": 1.15,
    "repetition_penalty_range": 0,
    "min_length": 0,
    "stop": [],
    "stream": false,
    "messages": [
        {
            "role": "system",
            "content": ""
        },
        {
            "role": "user",
            "content": "Can chickens fly? answer me in short."
        }
    ],
    "continue_": false
}

{
    "id": "chatcmpl-1713628329170337280",
    "object": "chat.completions",
    "created": 1713628329,
    "model": "Meta-Llama-3-70B-Instruct-5.0bpw-h6-exl2",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "message": {
                "role": "assistant",
                "content": "Yes, but not very well! Chickens can lift off the ground and glide for short distances (up to 10-15 feet), but they are not capable of sustained flight like other birds."
            }
        }
    ],
    "usage": {
        "prompt_tokens": 19,
        "completion_tokens": 40,
        "total_tokens": 59
    }
}

/v1/completions

{
    "max_tokens": 256,
    "temperature": 1.05,
    "top_p": 0.8,
    "min_p": 0.2,
    "top_k": 10,
    "typical_p": 0.8,
    "top_a": 10,
    "repetition_penalty": 1.15,
    "repetition_penalty_range": 0,
    "min_length": 0,
    "stop": [],
    "stream": false,
    "prompt": "Can chickens fly? answer me in short."
}

{
    "id": "conv-1713628396868506880",
    "object": "text_completion",
    "created": 1713628396,
    "model": "Meta-Llama-3-70B-Instruct-5.0bpw-h6-exl2",
    "choices": [
        {
            "index": 0,
            "finish_reason": "stop",
            "text": "**\n\n\n\nNo,",
            "logprobs": {
                "top_logprobs": [
                    {}
                ]
            }
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 4,
        "total_tokens": 13
    }
}

I'm not sure how to log the actual requests on the silly tavern side but here is verbose mode and what textgen received:

12:53:03-320299 INFO     GENERATE_PARAMS=                                                                                                                     
{   'max_new_tokens': 500,
    'temperature': 1.0,
    'temperature_last': False,
    'dynamic_temperature': False,
    'dynatemp_low': 1,
    'dynatemp_high': 1,
    'dynatemp_exponent': 1,
    'smoothing_factor': 0.22,
    'smoothing_curve': 1.0,
    'top_p': 1.0,
    'min_p': 0.0,
    'top_k': 0,
    'repetition_penalty': 1.0,
    'presence_penalty': 0.0,
    'frequency_penalty': 0.0,
    'repetition_penalty_range': 1024,
    'typical_p': 1.0,
    'tfs': 1.0,
    'top_a': 0.0,
    'guidance_scale': 1.0,
    'penalty_alpha': 0.0,
    'mirostat_mode': 0,
    'mirostat_tau': 7.35,
    'mirostat_eta': 0.1,
    'do_sample': True,
    'encoder_repetition_penalty': 1.0,
    'no_repeat_ngram_size': 0,
    'sampler_priority': [   'typical_p',
                            'min_p',
                            'temperature',
                            'dynamic_temperature',
                            'quadratic_sampling',
                            'top_k',
                            'top_p',
                            'epsilon_cutoff',
                            'eta_cutoff',
                            'tfs',
                            'top_a',
                            'mirostat'],
    'use_cache': True,
    'eos_token_id': [128001],
    'stopping_criteria': [   <modules.callbacks._StopEverythingStoppingCriteria object at 0x7fb929073730>],
    'logits_processor': [   <LogprobProcessor(logprobs=None, token_alternatives={})>]}

12:53:03-321840 INFO     PROMPT=                                                                                                                              

<|start_header_id|>system<|end_header_id|>
prompt goes here

12:53:03-538097 INFO     WARPERS=                                                                                                                             
['QuadraticSamplingLogitsWarper']

Traceback (most recent call last):
  File "/home/supermicro/ai/text-generation-webui-testing/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/home/supermicro/ai/text-generation-webui-testing/modules/text_generation.py", line 383, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/supermicro/miniconda3/envs/nvidia/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/supermicro/miniconda3/envs/nvidia/lib/python3.10/site-packages/transformers/generation/utils.py", line 1622, in generate
    result = self._sample(
  File "/home/supermicro/miniconda3/envs/nvidia/lib/python3.10/site-packages/transformers/generation/utils.py", line 2829, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 1.86 seconds (0.00 tokens/s, 0 tokens, context 631, seed 993078561)

It also doesn't 100% generate from the UI either. I was using git exllama but downgraded to .18 and there is no difference. neither does it help to do chat completions. Am also using torch 2.2.2

Ok.. so after I use sillytavern and it crashes once, it also begins to crash in the webui the same way. When I first load the model it gens and lets me switch a few presets in chat, chat-instruct, notebook, etc.

Can consistently crash it using mirostat in the webui.

Ok, I figured it out. I have 3 3090s. 2 of them have nvlink. One does not. When split between the 3090 with nvlink and the one without for some reason it does this. Loading it onto the 2 linked 3090s works. Not sure if it's an issue with the riser or a software issue or why this model.

I was able to solve it by updating to the latest cuda driver. It was a bug in 545.x

oobabooga / text-generation-webui