Closed yesbroc closed 1 year ago
realized it was the instruct chat thing in characters, wish it changed automatically its so annoying changing models then forgetting to change templatess
(also must use instruct)
Same model, similar kind of problem.|
INFO:connection open
ERROR:connection handler failed
Traceback (most recent call last):
File "D:\LLM\oobabooga_windows\installer_files\env\lib\site-packages\websockets\legacy\server.py", line 240, in handler
await self.ws_handler(self)
File "D:\LLM\oobabooga_windows\installer_files\env\lib\site-packages\websockets\legacy\server.py", line 1186, in _ws_handler
return await cast(
File "D:\LLM\oobabooga_windows\text-generation-webui\extensions\api\streaming_api.py", line 35, in _handle_connection
for a in generator:
File "D:\LLM\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 184, in generate_reply
for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat):
File "D:\LLM\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom
new_tokens = len(encode(original_question + reply)[0]) - original_tokens
UnboundLocalError: local variable 'reply' referenced before assignment
INFO:connection closed
it also happens when theyres tons of tokens
same issue. using default character. all work done on CPU. seems to happen regardless of characters, including with no character. doesn't matter if using instruct or not either. this is default settings across the board using the uncensored Wizard Mega 13B model quantized to 4 bits (using llama.cpp). none of the workarounds have had any effect thus far (for me).
Traceback (most recent call last): File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1193, in process_api result = await self.call_function( File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "/home/joe/anaconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 28, in run_sync return await get_asynclib().run_sync_in_worker_thread(func, args, cancellable=cancellable, File "/home/joe/anaconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread return await future File "/home/joe/anaconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 754, in run result = context.run(func, args) File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration return next(iterator) File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/chat.py", line 307, in generate_chat_reply_wrapper for history in generate_chat_reply(text, state, regenerate, _continue): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/chat.py", line 301, in generate_chat_reply for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/chat.py", line 224, in chatbot_wrapper for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/text_generation.py", line 184, in generate_reply for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/text_generation.py", line 316, in generate_reply_custom new_tokens = len(encode(original_question + reply)[0]) - original_tokens UnboundLocalError: local variable 'reply' referenced before assignment
specs: OS: Debian 11 bullseye CPU: 2x Xeon E5-2660 V2 (Dell PowerEdge R720) RAM: 128GB DDR3 GPU: none except integrated Matrox which does no work
Reducing context from 2048 to just 2000 solved this completely for me
I fixed it for myself just editing the file oobabooga\text-generation-webui\modules\text_generation.py in method _def generate_replycustom put reply = ''
(line around 292):
def generate_reply_custom(question, original_question, seed, state, eos_token=None, stopping_strings=None, is_chat=False):
seed = set_manual_seed(state['seed'])
generate_params = {'token_count': state['max_new_tokens']}
for k in ['temperature', 'top_p', 'top_k', 'repetition_penalty']:
generate_params[k] = state[k]
reply = '' # <-- this line
t0 = time.time()
I fixed it for myself just editing the file oobabooga\text-generation-webui\modules\text_generation.py in method _def generate_replycustom put
reply = ''
(line around 292):
Llama.generate: prefix-match hit ggml_new_tensor_impl: not enough space in the scratch memory Traceback (most recent call last): File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py", line 74, in generate for completion_chunk in completion_chunks: File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 651, in _create_completion for token in self.generate( File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 507, in generate self.eval(tokens) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 259, in eval return_code = llama_cpp.llama_eval( File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py", line 336, in llama_eval return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads) OSError: exception: access violation writing 0x0000000000000050 Output generated in 0.48 seconds (0.00 tokens/s, 0 tokens, context 1692, seed 1238759014)
tried this, guess not
Reducing context from 2048 to just 2000 solved this completely for me
you think wizard vicuna can only handle 2000 tokens? or is it the llama.cpp integration thats breaking
ggml_new_tensor_impl: not enough space in the scratch memory
Looks like just out of memory issue. Too large model with on too weak a hardware. Check this requiement tables https://www.reddit.com/r/LocalLLaMA/wiki/models/ - you must have enough RAM/VRAM with your model. For example its 20GB VRAM for 13B model.
ggml_new_tensor_impl: not enough space in the scratch memory
Looks like just out of memory issue. Too large model with on too weak a hardware. Check this requiement tables https://www.reddit.com/r/LocalLLaMA/wiki/models/ - you must have enough RAM/VRAM with your model. For example its 20GB VRAM for 13B model.
this bc of the new ggml method? ps. i definetly had enough ram to run 5 bit before may 12th (tried with wizardlm 7b as well, same issue)
Reducing context from 2048 to just 2000 solved this completely for me
you think wizard vicuna can only handle 2000 tokens? or is it the llama.cpp integration thats breaking
No i bet it's usual model but when context goes too high smth breaks, so if i just reduce max context slightly it doesn't break anymore. At this point i don't care what actually happens because 48 tokens of context i supposedly trade for flawless perfomance isn't smth to be worried about.
"Solved" here https://github.com/oobabooga/text-generation-webui/pull/2136, but this is a symptom of another error, probably out of memory?
This even happens when running CPU models on a Threadripper 3970X (32C/64T) with 256 GB of RAM... tried it with --mlock and without and with various models from 13B to 65B. It seems to be related to the length of the new input as it even happens with the very first input after loading the webui. I didn't test in detail, but around 200 to 300 tokens is fine, 400 to 500 create the error. The strange thing is that is sometimes works... and sometimes not.
the issue started for me when I updated to the newest version on May 16 or so. I did the previous update a few days before that, and everything worked fine up to May 16.
Traceback (most recent call last): File "H:\llm\installer_files\env\lib\site-packages\gradio\routes.py", line 414, in run_predict output = await app.get_blocks().process_api( File "H:\llm\installer_files\env\lib\site-packages\gradio\blocks.py", line 1323, in process_api result = await self.call_function( File "H:\llm\installer_files\env\lib\site-packages\gradio\blocks.py", line 1067, in call_function prediction = await utils.async_iteration(iterator) File "H:\llm\installer_files\env\lib\site-packages\gradio\utils.py", line 339, in async_iteration return await iterator.anext() File "H:\llm\installer_files\env\lib\site-packages\gradio\utils.py", line 332, in anext return await anyio.to_thread.run_sync( File "H:\llm\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\llm\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "H:\llm\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, *args) File "H:\llm\installer_files\env\lib\site-packages\gradio\utils.py", line 315, in run_sync_iterator_async return next(iterator) File "H:\llm\text-generation-webui\modules\chat.py", line 307, in generate_chat_reply_wrapper for history in generate_chat_reply(text, state, regenerate, _continue): File "H:\llm\text-generation-webui\modules\chat.py", line 301, in generate_chat_reply for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue): File "H:\llm\text-generation-webui\modules\chat.py", line 224, in chatbot_wrapper for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)): File "H:\llm\text-generation-webui\modules\text_generation.py", line 184, in generate_reply for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat): File "H:\llm\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom new_tokens = len(encode(original_question + reply)[0]) - original_tokens UnboundLocalError: local variable 'reply' referenced before assignment
That definitely seems like part of the issue. I've noticed it happens almost immediately with some of my characters with the long_term_memory extension enabled. Soon as it adds more information to the context it fails immediately. Adding
reply = ''
to text_generation.py then gives me an output of 'out of memory' when I'm running on CPU with 128 gigs of ram and I'm sitting with almost alarmingly low memory usage.
It'll still fail with characters loaded and long_term_memory not enabled, but it takes longer to do so.
There seems to be a connection with the "max_new_tokens" setting in the parameters. If I set it manually to some value that's lower than "2000 - number of tokens in in the input" it works
input 630 tokens works with max_new_tokens at 1300 but not at 1500. I suppose there is a check somwhere in the code now? the Output just stopped when the token limit was reached in pevious versions...
"Solved" here #2136, but this is a symptom of another error, probably out of memory?
ill try this thanks
"Solved" here #2136, but this is a symptom of another error, probably out of memory?
does work, llama.cpp 0.1.51 is just spewing out code lol
downgrading worked nvm, imma close this now
reopened because it magically stops generating instead of the unboundlocal error. i got --triton --n-gpu-layers 10000 --cache-capacity 8000.
INFO:Cache capacity is 8000 bytes llama.cpp: loading model from models\Wizard-Vicuna-7B-Uncensored.ggmlv2.q5_1.bin llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 72.75 KB llama_model_load_internal: mem required = 6612.59 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | INFO:Loaded the model in 6.87 seconds.
INFO:Loading the extension "gallery"... Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
Llama._create_completion: cache miss
Llama._create_completion: cache save
Llama.save_state: saving 22737964 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 0.64 ms / 3 runs ( 0.21 ms per token) llama_print_timings: prompt eval time = 3503.85 ms / 41 tokens ( 85.46 ms per token) llama_print_timings: eval time = 747.73 ms / 2 runs ( 373.87 ms per token) llama_print_timings: total time = 4598.33 ms Output generated in 5.15 seconds (0.39 tokens/s, 2 tokens, context 41, seed 424028198) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 41612332 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 4.50 ms / 20 runs ( 0.22 ms per token) llama_print_timings: prompt eval time = 1330.16 ms / 17 tokens ( 78.24 ms per token) llama_print_timings: eval time = 6751.96 ms / 19 runs ( 355.37 ms per token) llama_print_timings: total time = 10545.49 ms Output generated in 10.97 seconds (1.73 tokens/s, 19 tokens, context 60, seed 410699725) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 39515180 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 2.39 ms / 10 runs ( 0.24 ms per token) llama_print_timings: prompt eval time = 4336.88 ms / 65 tokens ( 66.72 ms per token) llama_print_timings: eval time = 2875.09 ms / 9 runs ( 319.45 ms per token) llama_print_timings: total time = 8348.32 ms Output generated in 8.95 seconds (1.01 tokens/s, 9 tokens, context 66, seed 1575520995) Output generated in 0.52 seconds (0.00 tokens/s, 0 tokens, context 70, seed 1631212757) Output generated in 0.49 seconds (0.00 tokens/s, 0 tokens, context 71, seed 865281745) Llama._create_completion: cache miss Llama.generate: prefix-match hit Output generated in 9.99 seconds (1.80 tokens/s, 18 tokens, context 43, seed 428689201) Output generated in 0.39 seconds (0.00 tokens/s, 0 tokens, context 61, seed 1255914242) Output generated in 0.41 seconds (0.00 tokens/s, 0 tokens, context 129, seed 1473116747) Output generated in 0.40 seconds (0.00 tokens/s, 0 tokens, context 49, seed 2096328356) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 22737964 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 0.68 ms / 3 runs ( 0.23 ms per token) llama_print_timings: prompt eval time = 467.04 ms / 6 tokens ( 77.84 ms per token) llama_print_timings: eval time = 587.56 ms / 2 runs ( 293.78 ms per token) llama_print_timings: total time = 1388.98 ms Output generated in 1.79 seconds (1.11 tokens/s, 2 tokens, context 41, seed 529334198) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 45806636 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 9.23 ms / 43 runs ( 0.21 ms per token) llama_print_timings: prompt eval time = 824.12 ms / 10 tokens ( 82.41 ms per token) llama_print_timings: eval time = 12144.30 ms / 42 runs ( 289.15 ms per token) llama_print_timings: total time = 17187.52 ms Output generated in 17.59 seconds (2.39 tokens/s, 42 tokens, context 45, seed 493354527) Output generated in 0.44 seconds (0.00 tokens/s, 0 tokens, context 87, seed 795152686) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 27456556 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 2.16 ms / 11 runs ( 0.20 ms per token) llama_print_timings: prompt eval time = 595.82 ms / 7 tokens ( 85.12 ms per token) llama_print_timings: eval time = 3037.12 ms / 10 runs ( 303.71 ms per token) llama_print_timings: total time = 4818.01 ms Output generated in 5.26 seconds (1.90 tokens/s, 10 tokens, context 42, seed 364646693) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 87749676 bytes of llama state
llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 26.13 ms / 124 runs ( 0.21 ms per token) llama_print_timings: prompt eval time = 657.34 ms / 9 tokens ( 73.04 ms per token) llama_print_timings: eval time = 35427.72 ms / 123 runs ( 288.03 ms per token) llama_print_timings: total time = 48059.42 ms Output generated in 48.44 seconds (2.54 tokens/s, 123 tokens, context 44, seed 1819441613) Output generated in 0.39 seconds (0.00 tokens/s, 0 tokens, context 382, seed 39045789)
reopened cuz i think this issue is related to this error ps i updated to the latest version of llama.cpp, also could be the outdated model. just gotta wait for the ggmlv3 models to finish
ggmlv3 isnt supported by ooba 💀
Which one do you want to load? 1-6
3
INFO:Loading Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin... INFO:llama.cpp weights detected: models\Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin
INFO:Cache capacity is 8000 bytes
llama.cpp: loading model from models\Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?
llama_init_from_file: failed to load model
Traceback (most recent call last):
File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\server.py", line 998, in
Done!
Describe the bug
"reply" wasnt referenced or smth, so i guess it breaks every time i send something. i have gotten around this but thats still unreliable. triton helps a ton
Is there an existing issue for this?
Reproduction
load ggml models with these flags: --chat --triton --xformers (dk why) --n-gpu-layers 200 --threads 14 --extensions long_term_memory
Screenshot
No response
Logs
System Info