Closed Ph0rk0z closed 5 months ago
Can you post your request parameters? The first model I used didn't have this problem. Random settings in my test.
Python 3.10.12
Name: torch Version: 2.2.1+cu121
Name: transformers Version: 4.40.0
Name: exllamav2 Version: 0.0.18+cu121
{
"max_tokens": 256,
"temperature": 1.05,
"top_p": 0.8,
"min_p": 0.2,
"top_k": 10,
"typical_p": 0.8,
"top_a": 10,
"repetition_penalty": 1.15,
"repetition_penalty_range": 0,
"min_length": 0,
"stop": [],
"stream": false,
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "Can chickens fly? answer me in short."
}
],
"continue_": false
}
{
"id": "chatcmpl-1713628329170337280",
"object": "chat.completions",
"created": 1713628329,
"model": "Meta-Llama-3-70B-Instruct-5.0bpw-h6-exl2",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": "Yes, but not very well! Chickens can lift off the ground and glide for short distances (up to 10-15 feet), but they are not capable of sustained flight like other birds."
}
}
],
"usage": {
"prompt_tokens": 19,
"completion_tokens": 40,
"total_tokens": 59
}
}
{
"max_tokens": 256,
"temperature": 1.05,
"top_p": 0.8,
"min_p": 0.2,
"top_k": 10,
"typical_p": 0.8,
"top_a": 10,
"repetition_penalty": 1.15,
"repetition_penalty_range": 0,
"min_length": 0,
"stop": [],
"stream": false,
"prompt": "Can chickens fly? answer me in short."
}
{
"id": "conv-1713628396868506880",
"object": "text_completion",
"created": 1713628396,
"model": "Meta-Llama-3-70B-Instruct-5.0bpw-h6-exl2",
"choices": [
{
"index": 0,
"finish_reason": "stop",
"text": "**\n\n\n\nNo,",
"logprobs": {
"top_logprobs": [
{}
]
}
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 4,
"total_tokens": 13
}
}
I'm not sure how to log the actual requests on the silly tavern side but here is verbose mode and what textgen received:
12:53:03-320299 INFO GENERATE_PARAMS=
{ 'max_new_tokens': 500,
'temperature': 1.0,
'temperature_last': False,
'dynamic_temperature': False,
'dynatemp_low': 1,
'dynatemp_high': 1,
'dynatemp_exponent': 1,
'smoothing_factor': 0.22,
'smoothing_curve': 1.0,
'top_p': 1.0,
'min_p': 0.0,
'top_k': 0,
'repetition_penalty': 1.0,
'presence_penalty': 0.0,
'frequency_penalty': 0.0,
'repetition_penalty_range': 1024,
'typical_p': 1.0,
'tfs': 1.0,
'top_a': 0.0,
'guidance_scale': 1.0,
'penalty_alpha': 0.0,
'mirostat_mode': 0,
'mirostat_tau': 7.35,
'mirostat_eta': 0.1,
'do_sample': True,
'encoder_repetition_penalty': 1.0,
'no_repeat_ngram_size': 0,
'sampler_priority': [ 'typical_p',
'min_p',
'temperature',
'dynamic_temperature',
'quadratic_sampling',
'top_k',
'top_p',
'epsilon_cutoff',
'eta_cutoff',
'tfs',
'top_a',
'mirostat'],
'use_cache': True,
'eos_token_id': [128001],
'stopping_criteria': [ <modules.callbacks._StopEverythingStoppingCriteria object at 0x7fb929073730>],
'logits_processor': [ <LogprobProcessor(logprobs=None, token_alternatives={})>]}
12:53:03-321840 INFO PROMPT=
<|start_header_id|>system<|end_header_id|>
prompt goes here
12:53:03-538097 INFO WARPERS=
['QuadraticSamplingLogitsWarper']
Traceback (most recent call last):
File "/home/supermicro/ai/text-generation-webui-testing/modules/callbacks.py", line 61, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/home/supermicro/ai/text-generation-webui-testing/modules/text_generation.py", line 383, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/supermicro/miniconda3/envs/nvidia/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/supermicro/miniconda3/envs/nvidia/lib/python3.10/site-packages/transformers/generation/utils.py", line 1622, in generate
result = self._sample(
File "/home/supermicro/miniconda3/envs/nvidia/lib/python3.10/site-packages/transformers/generation/utils.py", line 2829, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 1.86 seconds (0.00 tokens/s, 0 tokens, context 631, seed 993078561)
It also doesn't 100% generate from the UI either. I was using git exllama but downgraded to .18 and there is no difference. neither does it help to do chat completions. Am also using torch 2.2.2
Ok.. so after I use sillytavern and it crashes once, it also begins to crash in the webui the same way. When I first load the model it gens and lets me switch a few presets in chat, chat-instruct, notebook, etc.
Can consistently crash it using mirostat in the webui.
Ok, I figured it out. I have 3 3090s. 2 of them have nvlink. One does not. When split between the 3090 with nvlink and the one without for some reason it does this. Loading it onto the 2 linked 3090s works. Not sure if it's an issue with the riser or a software issue or why this model.
I was able to solve it by updating to the latest cuda driver. It was a bug in 545.x
Describe the bug
At first I downloaded https://huggingface.co/LoneStriker/Meta-Llama-3-70B-Instruct-5.0bpw-h6-exl2. Using any sampling within textgen caused a NaN error while using any kind of sampling, especially more advanced stuff like min_P. Adding the pad token to the config makes it gen within the UI most of the time.
Next I downloaded https://huggingface.co/turboderp/Llama-3-70B-Instruct-exl2/tree/5.0bpw and this one works within the webui but still causes NaN over the API.
I have tried different environments and git of transformers as well as torch 2.2.2 and 2.2.1 Streaming and not. The 8b model works fine and so do other models. Turning off do-sample allows it to gen. Hope I'm not the only one with this issue.
Is there an existing issue for this?
Reproduction
Load lonestriker quant and try to generate within the webui. Load turboderp quant and generate over the openAI api.
Screenshot
No response
Logs
System Info