Open Stargate256 opened 2 weeks ago
I am running into the same issue, also on Debian 12, on an older Intel CPU, while trying to run a Qwen2.5 exl2 model over the Open AI API (With cline and aider). In my case a few requests work and then it encounters this error after which the responses contain no/few characters. Unloading and loading the model again doesn't seem to help.
I'm running web-ui directly on physical hardware. I tried upgrading all the packages in my system which brought in a new kernel version but nothing changed after the upgrade.
Logs
13:27:44-948132 INFO Starting Text generation web UI
13:27:44-952671 WARNING
You are potentially exposing the web UI to the entire
internet without any access password.
You can create one with the "--gradio-auth" flag like
this:
--gradio-auth username:password
Make sure to replace username:password with your own.
13:27:44-954803 INFO Loading the extension "openai"
13:27:45-089753 INFO OpenAI-compatible API URL:
http://0.0.0.0:5000
Running on local URL: http://0.0.0.0:7860
13:27:51-237158 INFO Loading
"bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
warnings.warn(
13:27:58-621663 INFO Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
in 7.38 seconds.
13:27:58-623069 INFO LOADER: "ExLlamav2_HF"
13:27:58-624496 INFO TRUNCATION LENGTH: 8000
13:27:58-625390 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model
metadata)"
^[[A/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
warnings.warn(
Output generated in 4.25 seconds (21.15 tokens/s, 90 tokens, context 896, seed 2117395925)
Output generated in 2.80 seconds (22.11 tokens/s, 62 tokens, context 1011, seed 1939351019)
Traceback (most recent call last):
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
return self._decode(
File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 2.30 seconds (21.29 tokens/s, 49 tokens, context 1274, seed 1253585262)
Traceback (most recent call last):
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
return self._decode(
File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 0.74 seconds (1.36 tokens/s, 1 tokens, context 1153, seed 8088406)
13:32:10-609612 INFO Loading
"bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
13:32:16-405005 INFO Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
in 5.79 seconds.
13:32:16-407239 INFO LOADER: "ExLlamav2_HF"
13:32:16-408041 INFO TRUNCATION LENGTH: 8000
13:32:16-408885 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model
metadata)"
Traceback (most recent call last):
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
return self._decode(
File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 1.30 seconds (0.77 tokens/s, 1 tokens, context 1176, seed 304963644)
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
CPU family: 6
Model: 58
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 9
CPU(s) scaling MHz: 42%
CPU max MHz: 3800.0000
CPU min MHz: 1600.0000
BogoMIPS: 6799.95
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mm
x fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_go
od nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rd
rand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid
fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS
Not affected; BHI Not affected
Srbds: Vulnerable: No microcode
Tsx async abort: Not affected
free -m
total used free shared buff/cache available
Mem: 23982 1365 12767 4 10200 22617
Swap: 7999 0 7999
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 11W / 170W | 181MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 777 G /usr/lib/xorg/Xorg 167MiB |
| 0 N/A N/A 964 G /usr/bin/gnome-shell 8MiB |
+---------------------------------------------------------------------------------------+
uname -a
Linux bash-3lpc 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux
python3 --version
Python 3.11.2
conda --version
conda 23.5.2
cat /etc/debian_version
12.8
git rev-parse HEAD
cc8c7ed2093cbc747e7032420eae14b5b3c30311
Actually, it seems like the ExLlamav2 loader works. Previously I was using the auto suggested ExLlamav2_HF loader.
Logs (prompts were sent from Aider)
14:04:59-537587 INFO Loading "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
14:05:06-734713 INFO Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25" in 7.20 seconds.
14:05:06-736142 INFO LOADER: "ExLlamav2"
14:05:06-737239 INFO TRUNCATION LENGTH: 8000
14:05:06-738057 INFO INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"
Output generated in 6.36 seconds (23.59 tokens/s, 150 tokens, context 1223, seed 186741678)
Output generated in 6.44 seconds (28.58 tokens/s, 184 tokens, context 791, seed 1585153886)
Output generated in 8.19 seconds (28.21 tokens/s, 231 tokens, context 1100, seed 1836764177)
Output generated in 7.53 seconds (29.60 tokens/s, 223 tokens, context 1371, seed 991524308)
Output generated in 10.12 seconds (29.56 tokens/s, 299 tokens, context 991, seed 455187287)
Output generated in 7.84 seconds (14.41 tokens/s, 113 tokens, context 4991, seed 1045094786)
Output generated in 1.10 seconds (16.42 tokens/s, 18 tokens, context 5210, seed 2042193150)
Output generated in 1.02 seconds (17.63 tokens/s, 18 tokens, context 5334, seed 1359728911)
Output generated in 1.10 seconds (19.17 tokens/s, 21 tokens, context 5458, seed 1694625255)
Output generated in 1.03 seconds (17.47 tokens/s, 18 tokens, context 5584, seed 1240670815)
Output generated in 1.10 seconds (19.02 tokens/s, 21 tokens, context 5708, seed 951578707)
Output generated in 1.11 seconds (18.96 tokens/s, 21 tokens, context 5834, seed 498927830)
Output generated in 1.11 seconds (18.88 tokens/s, 21 tokens, context 5960, seed 131397278)
Output generated in 1.12 seconds (18.82 tokens/s, 21 tokens, context 6086, seed 521276101)
Output generated in 1.13 seconds (18.59 tokens/s, 21 tokens, context 6212, seed 995108441)
Output generated in 1.13 seconds (18.54 tokens/s, 21 tokens, context 6338, seed 143805776)
Output generated in 2.15 seconds (22.81 tokens/s, 49 tokens, context 6464, seed 2070214832)
Output generated in 3.90 seconds (28.23 tokens/s, 110 tokens, context 4991, seed 805553205)
Output generated in 1.10 seconds (16.41 tokens/s, 18 tokens, context 5207, seed 1120525451)
Output generated in 1.01 seconds (17.75 tokens/s, 18 tokens, context 5331, seed 693321549)
Output generated in 1.10 seconds (19.16 tokens/s, 21 tokens, context 5455, seed 763349559)
Output generated in 0.86 seconds (16.28 tokens/s, 14 tokens, context 5581, seed 1450090146)
Output generated in 1.54 seconds (12.37 tokens/s, 19 tokens, context 1130, seed 1622652563)
Output generated in 7.42 seconds (30.73 tokens/s, 228 tokens, context 1175, seed 1043527426)
Output generated in 11.83 seconds (30.60 tokens/s, 362 tokens, context 762, seed 1054832108)
Output generated in 8.22 seconds (27.37 tokens/s, 225 tokens, context 1430, seed 600097550)
Output generated in 11.96 seconds (30.26 tokens/s, 362 tokens, context 866, seed 832375840)
Output generated in 8.24 seconds (28.16 tokens/s, 232 tokens, context 1188, seed 1514631067)
Output generated in 11.94 seconds (30.33 tokens/s, 362 tokens, context 830, seed 1119770377)
Output generated in 7.61 seconds (27.85 tokens/s, 212 tokens, context 1206, seed 83295453)
Output generated in 13.99 seconds (30.59 tokens/s, 428 tokens, context 836, seed 1989837235)
Output generated in 7.92 seconds (27.92 tokens/s, 221 tokens, context 1252, seed 1324992220)
Output generated in 15.83 seconds (30.70 tokens/s, 486 tokens, context 874, seed 151775036)
Output generated in 11.13 seconds (29.03 tokens/s, 323 tokens, context 1232, seed 746128985)
Just wanted to confirm that I have the same issue.
Model: bartowski/Qwen2.5-Coder-32B-Instruct-exl2 @ 4.25 Loader: ExLlamav2_HF
Describe the bug
When running inference over openAI compatable API with Perplexica or avante.nvim the error sometimes appears, after that happnes it doesn't work anymore until I restart the program. (It worked fine with Open WebUI)
Is there an existing issue for this?
Reproduction
Screenshot
Logs
System Info