something wrong with reply generation in llama.cpp

yesbroc commented 1 year ago

Describe the bug

"reply" wasnt referenced or smth, so i guess it breaks every time i send something. i have gotten around this but thats still unreliable. triton helps a ton

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

load ggml models with these flags: --chat --triton --xformers (dk why) --n-gpu-layers 200 --threads 14 --extensions long_term_memory

Screenshot

No response

Logs

WARNING:WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.0.0+cu118 with CUDA 1108 (you have 2.0.1+cpu)
    Python  3.10.11 (you have 3.10.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
bin C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll
C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
INFO:Loading settings from settings.json...
The following models are available:

1. mpt-7b-storywriter-ggml_v0-q5_0.bin.fdmdownload
2. wizard-mega-13B.ggml.q5_0.bin
3. Wizard-Vicuna-7B-Uncensored.ggmlv2.q5_0.bin
4. Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin
5. zzzz

Which one do you want to load? 1-5

4

INFO:Loading Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin...
INFO:llama.cpp weights detected: models\Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
INFO:Replaced attention with xformers_attention
INFO:Loaded the model in 11.33 seconds.

INFO:Loading the extension "long_term_memory"...
INFO:Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2

-----------------------------------------
IMPORTANT LONG TERM MEMORY NOTES TO USER:
-----------------------------------------
Please remember that LTM-stored memories will only be visible to the bot during your NEXT session. This prevents the loaded memory from being flooded with messages from the current conversation which would defeat the original purpose of this module. This can be overridden by pressing 'Force reload memories'
----------
LTM CONFIG
----------
change these values in ltm_config.json
{'ltm_context': {'injection_location': 'BEFORE_NORMAL_CONTEXT',
                 'memory_context_template': "{name2}'s memory log:\n"
                                            '{all_memories}\n'
                                            'During conversations between '
                                            '{name1} and {name2}, {name2} will '
                                            'try to remember the memory '
                                            'described above and naturally '
                                            'integrate it with the '
                                            'conversation.',
                 'memory_template': '{time_difference}, {memory_name} said:\n'
                                    '"{memory_message}"'},
 'ltm_reads': {'max_cosine_distance': 0.6,
               'memory_length_cutoff_in_chars': 1000,
               'num_memories_to_fetch': 2},
 'ltm_writes': {'min_message_length': 100}}
----------
-----------------------------------------
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
============================================================
Batches: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.33it/s]

llama_print_timings:        load time = 17875.67 ms
llama_print_timings:      sample time =     3.68 ms /    18 runs   (    0.20 ms per token)
llama_print_timings: prompt eval time = 17875.57 ms /    34 tokens (  525.75 ms per token)
llama_print_timings:        eval time = 13621.92 ms /    17 runs   (  801.29 ms per token)
llama_print_timings:       total time = 34520.43 ms
Output generated in 35.10 seconds (0.48 tokens/s, 17 tokens, context 34, seed 661885440)
============================================================
Batches: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.09it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time = 17875.67 ms
llama_print_timings:      sample time =     3.78 ms /    16 runs   (    0.24 ms per token)
llama_print_timings: prompt eval time =  4992.72 ms /    33 tokens (  151.29 ms per token)
llama_print_timings:        eval time =  9812.83 ms /    15 runs   (  654.19 ms per token)
llama_print_timings:       total time = 16544.04 ms
Output generated in 17.18 seconds (0.87 tokens/s, 15 tokens, context 47, seed 235948689)
============================================================
Batches: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 22.83it/s]
Llama.generate: prefix-match hit

llama_print_timings:        load time = 17875.67 ms
llama_print_timings:      sample time =    49.18 ms /   239 runs   (    0.21 ms per token)
llama_print_timings: prompt eval time =  3646.96 ms /    30 tokens (  121.57 ms per token)
llama_print_timings:        eval time = 130797.85 ms /   238 runs   (  549.57 ms per token)
llama_print_timings:       total time = 158200.78 ms
Output generated in 158.79 seconds (1.50 tokens/s, 238 tokens, context 47, seed 1537856493)
Traceback (most recent call last):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 414, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1067, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 339, in async_iteration
    return await iterator.__anext__()
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 332, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 315, in run_sync_iterator_async
    return next(iterator)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 307, in generate_chat_reply_wrapper
    for history in generate_chat_reply(text, state, regenerate, _continue):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 301, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 224, in chatbot_wrapper
    for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 184, in generate_reply
    for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom
    new_tokens = len(encode(original_question + reply)[0]) - original_tokens
UnboundLocalError: local variable 'reply' referenced before assignment
============================================================
Batches: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.08it/s]
Traceback (most recent call last):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 414, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1067, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 339, in async_iteration
    return await iterator.__anext__()
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 332, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 315, in run_sync_iterator_async
    return next(iterator)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 307, in generate_chat_reply_wrapper
    for history in generate_chat_reply(text, state, regenerate, _continue):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 301, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 224, in chatbot_wrapper
    for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 184, in generate_reply
    for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom
    new_tokens = len(encode(original_question + reply)[0]) - original_tokens
UnboundLocalError: local variable 'reply' referenced before assignment
============================================================
Batches: 100%|██████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.85it/s]
Traceback (most recent call last):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\routes.py", line 414, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\blocks.py", line 1067, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 339, in async_iteration
    return await iterator.__anext__()
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 332, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\anyio\_backends\_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\gradio\utils.py", line 315, in run_sync_iterator_async
    return next(iterator)
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 307, in generate_chat_reply_wrapper
    for history in generate_chat_reply(text, state, regenerate, _continue):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 301, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 224, in chatbot_wrapper
    for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 184, in generate_reply
    for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat):
  File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom
    new_tokens = len(encode(original_question + reply)[0]) - original_tokens
UnboundLocalError: local variable 'reply' referenced before assignment

System Info

nvidia rtx 3050ti mobile
16gb ram
ryzen cpu
win 11 home

yesbroc commented 1 year ago

realized it was the instruct chat thing in characters, wish it changed automatically its so annoying changing models then forgetting to change templatess

yesbroc commented 1 year ago

(also must use instruct)

Priestru commented 1 year ago

Same model, similar kind of problem.|

INFO:connection open
ERROR:connection handler failed
Traceback (most recent call last):
  File "D:\LLM\oobabooga_windows\installer_files\env\lib\site-packages\websockets\legacy\server.py", line 240, in handler
    await self.ws_handler(self)
  File "D:\LLM\oobabooga_windows\installer_files\env\lib\site-packages\websockets\legacy\server.py", line 1186, in _ws_handler
    return await cast(
  File "D:\LLM\oobabooga_windows\text-generation-webui\extensions\api\streaming_api.py", line 35, in _handle_connection
    for a in generator:
  File "D:\LLM\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 184, in generate_reply
    for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat):
  File "D:\LLM\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom
    new_tokens = len(encode(original_question + reply)[0]) - original_tokens
UnboundLocalError: local variable 'reply' referenced before assignment
INFO:connection closed

yesbroc commented 1 year ago

it also happens when theyres tons of tokens

xontinuity commented 1 year ago

same issue. using default character. all work done on CPU. seems to happen regardless of characters, including with no character. doesn't matter if using instruct or not either. this is default settings across the board using the uncensored Wizard Mega 13B model quantized to 4 bits (using llama.cpp). none of the workarounds have had any effect thus far (for me).

Traceback (most recent call last): File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/routes.py", line 395, in run_predict output = await app.get_blocks().process_api( File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1193, in process_api result = await self.call_function( File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/blocks.py", line 930, in call_function prediction = await anyio.to_thread.run_sync( File "/home/joe/anaconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 28, in run_sync return await get_asynclib().run_sync_in_worker_thread(func, args, cancellable=cancellable, File "/home/joe/anaconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread return await future File "/home/joe/anaconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 754, in run result = context.run(func, args) File "/home/joe/anaconda3/lib/python3.10/site-packages/gradio/utils.py", line 491, in async_iteration return next(iterator) File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/chat.py", line 307, in generate_chat_reply_wrapper for history in generate_chat_reply(text, state, regenerate, _continue): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/chat.py", line 301, in generate_chat_reply for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/chat.py", line 224, in chatbot_wrapper for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/text_generation.py", line 184, in generate_reply for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat): File "/home/joe/Documents/clones/oobabooga_linux/text-generation-webui/modules/text_generation.py", line 316, in generate_reply_custom new_tokens = len(encode(original_question + reply)[0]) - original_tokens UnboundLocalError: local variable 'reply' referenced before assignment

specs: OS: Debian 11 bullseye CPU: 2x Xeon E5-2660 V2 (Dell PowerEdge R720) RAM: 128GB DDR3 GPU: none except integrated Matrox which does no work

Priestru commented 1 year ago

Reducing context from 2048 to just 2000 solved this completely for me

Norvand commented 1 year ago

I fixed it for myself just editing the file oobabooga\text-generation-webui\modules\text_generation.py in method _def generate_replycustom put reply = '' (line around 292):

def generate_reply_custom(question, original_question, seed, state, eos_token=None, stopping_strings=None, is_chat=False):
    seed = set_manual_seed(state['seed'])
    generate_params = {'token_count': state['max_new_tokens']}
    for k in ['temperature', 'top_p', 'top_k', 'repetition_penalty']:
        generate_params[k] = state[k]
    reply = '' # <-- this line
    t0 = time.time()

yesbroc commented 1 year ago

I fixed it for myself just editing the file oobabooga\text-generation-webui\modules\text_generation.py in method _def generate_replycustom put reply = '' (line around 292):

Llama.generate: prefix-match hit ggml_new_tensor_impl: not enough space in the scratch memory Traceback (most recent call last): File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py", line 74, in generate for completion_chunk in completion_chunks: File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 651, in _create_completion for token in self.generate( File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 507, in generate self.eval(tokens) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 259, in eval return_code = llama_cpp.llama_eval( File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py", line 336, in llama_eval return _lib.llama_eval(ctx, tokens, n_tokens, n_past, n_threads) OSError: exception: access violation writing 0x0000000000000050 Output generated in 0.48 seconds (0.00 tokens/s, 0 tokens, context 1692, seed 1238759014)

tried this, guess not

yesbroc commented 1 year ago

Reducing context from 2048 to just 2000 solved this completely for me

you think wizard vicuna can only handle 2000 tokens? or is it the llama.cpp integration thats breaking

Norvand commented 1 year ago

ggml_new_tensor_impl: not enough space in the scratch memory

Looks like just out of memory issue. Too large model with on too weak a hardware. Check this requiement tables https://www.reddit.com/r/LocalLLaMA/wiki/models/ - you must have enough RAM/VRAM with your model. For example its 20GB VRAM for 13B model.

yesbroc commented 1 year ago

ggml_new_tensor_impl: not enough space in the scratch memory

Looks like just out of memory issue. Too large model with on too weak a hardware. Check this requiement tables https://www.reddit.com/r/LocalLLaMA/wiki/models/ - you must have enough RAM/VRAM with your model. For example its 20GB VRAM for 13B model.

this bc of the new ggml method? ps. i definetly had enough ram to run 5 bit before may 12th (tried with wizardlm 7b as well, same issue)

Priestru commented 1 year ago

Reducing context from 2048 to just 2000 solved this completely for me

you think wizard vicuna can only handle 2000 tokens? or is it the llama.cpp integration thats breaking

No i bet it's usual model but when context goes too high smth breaks, so if i just reduce max context slightly it doesn't break anymore. At this point i don't care what actually happens because 48 tokens of context i supposedly trade for flawless perfomance isn't smth to be worried about.

oobabooga commented 1 year ago

"Solved" here https://github.com/oobabooga/text-generation-webui/pull/2136, but this is a symptom of another error, probably out of memory?

321MathiasS123 commented 1 year ago

This even happens when running CPU models on a Threadripper 3970X (32C/64T) with 256 GB of RAM... tried it with --mlock and without and with various models from 13B to 65B. It seems to be related to the length of the new input as it even happens with the very first input after loading the webui. I didn't test in detail, but around 200 to 300 tokens is fine, 400 to 500 create the error. The strange thing is that is sometimes works... and sometimes not.

the issue started for me when I updated to the newest version on May 16 or so. I did the previous update a few days before that, and everything worked fine up to May 16.

Traceback (most recent call last): File "H:\llm\installer_files\env\lib\site-packages\gradio\routes.py", line 414, in run_predict output = await app.get_blocks().process_api( File "H:\llm\installer_files\env\lib\site-packages\gradio\blocks.py", line 1323, in process_api result = await self.call_function( File "H:\llm\installer_files\env\lib\site-packages\gradio\blocks.py", line 1067, in call_function prediction = await utils.async_iteration(iterator) File "H:\llm\installer_files\env\lib\site-packages\gradio\utils.py", line 339, in async_iteration return await iterator.anext() File "H:\llm\installer_files\env\lib\site-packages\gradio\utils.py", line 332, in anext return await anyio.to_thread.run_sync( File "H:\llm\installer_files\env\lib\site-packages\anyio\to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "H:\llm\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "H:\llm\installer_files\env\lib\site-packages\anyio_backends_asyncio.py", line 867, in run result = context.run(func, *args) File "H:\llm\installer_files\env\lib\site-packages\gradio\utils.py", line 315, in run_sync_iterator_async return next(iterator) File "H:\llm\text-generation-webui\modules\chat.py", line 307, in generate_chat_reply_wrapper for history in generate_chat_reply(text, state, regenerate, _continue): File "H:\llm\text-generation-webui\modules\chat.py", line 301, in generate_chat_reply for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue): File "H:\llm\text-generation-webui\modules\chat.py", line 224, in chatbot_wrapper for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, eos_token=eos_token, stopping_strings=stopping_strings, is_chat=True)): File "H:\llm\text-generation-webui\modules\text_generation.py", line 184, in generate_reply for reply in generate_func(question, original_question, seed, state, eos_token, stopping_strings, is_chat=is_chat): File "H:\llm\text-generation-webui\modules\text_generation.py", line 316, in generate_reply_custom new_tokens = len(encode(original_question + reply)[0]) - original_tokens UnboundLocalError: local variable 'reply' referenced before assignment

xontinuity commented 1 year ago

That definitely seems like part of the issue. I've noticed it happens almost immediately with some of my characters with the long_term_memory extension enabled. Soon as it adds more information to the context it fails immediately. Adding

reply = ''

to text_generation.py then gives me an output of 'out of memory' when I'm running on CPU with 128 gigs of ram and I'm sitting with almost alarmingly low memory usage.

It'll still fail with characters loaded and long_term_memory not enabled, but it takes longer to do so.

321MathiasS123 commented 1 year ago

There seems to be a connection with the "max_new_tokens" setting in the parameters. If I set it manually to some value that's lower than "2000 - number of tokens in in the input" it works

input 630 tokens works with max_new_tokens at 1300 but not at 1500. I suppose there is a check somwhere in the code now? the Output just stopped when the token limit was reached in pevious versions...

yesbroc commented 1 year ago

"Solved" here #2136, but this is a symptom of another error, probably out of memory?

ill try this thanks

yesbroc commented 1 year ago

"Solved" here #2136, but this is a symptom of another error, probably out of memory?

does work, llama.cpp 0.1.51 is just spewing out code lol

yesbroc commented 1 year ago

downgrading worked nvm, imma close this now

yesbroc commented 1 year ago

reopened because it magically stops generating instead of the unboundlocal error. i got --triton --n-gpu-layers 10000 --cache-capacity 8000.

INFO:Cache capacity is 8000 bytes llama.cpp: loading model from models\Wizard-Vicuna-7B-Uncensored.ggmlv2.q5_1.bin llama_model_load_internal: format = ggjt v2 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 72.75 KB llama_model_load_internal: mem required = 6612.59 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 1024.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | INFO:Loaded the model in 6.87 seconds.

INFO:Loading the extension "gallery"... Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). Llama._create_completion: cache miss Llama._create_completion: cache save Llama.save_state: saving 22737964 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 0.64 ms / 3 runs ( 0.21 ms per token) llama_print_timings: prompt eval time = 3503.85 ms / 41 tokens ( 85.46 ms per token) llama_print_timings: eval time = 747.73 ms / 2 runs ( 373.87 ms per token) llama_print_timings: total time = 4598.33 ms Output generated in 5.15 seconds (0.39 tokens/s, 2 tokens, context 41, seed 424028198) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 41612332 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 4.50 ms / 20 runs ( 0.22 ms per token) llama_print_timings: prompt eval time = 1330.16 ms / 17 tokens ( 78.24 ms per token) llama_print_timings: eval time = 6751.96 ms / 19 runs ( 355.37 ms per token) llama_print_timings: total time = 10545.49 ms Output generated in 10.97 seconds (1.73 tokens/s, 19 tokens, context 60, seed 410699725) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 39515180 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 2.39 ms / 10 runs ( 0.24 ms per token) llama_print_timings: prompt eval time = 4336.88 ms / 65 tokens ( 66.72 ms per token) llama_print_timings: eval time = 2875.09 ms / 9 runs ( 319.45 ms per token) llama_print_timings: total time = 8348.32 ms Output generated in 8.95 seconds (1.01 tokens/s, 9 tokens, context 66, seed 1575520995) Output generated in 0.52 seconds (0.00 tokens/s, 0 tokens, context 70, seed 1631212757) Output generated in 0.49 seconds (0.00 tokens/s, 0 tokens, context 71, seed 865281745) Llama._create_completion: cache miss Llama.generate: prefix-match hit Output generated in 9.99 seconds (1.80 tokens/s, 18 tokens, context 43, seed 428689201) Output generated in 0.39 seconds (0.00 tokens/s, 0 tokens, context 61, seed 1255914242) Output generated in 0.41 seconds (0.00 tokens/s, 0 tokens, context 129, seed 1473116747) Output generated in 0.40 seconds (0.00 tokens/s, 0 tokens, context 49, seed 2096328356) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 22737964 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 0.68 ms / 3 runs ( 0.23 ms per token) llama_print_timings: prompt eval time = 467.04 ms / 6 tokens ( 77.84 ms per token) llama_print_timings: eval time = 587.56 ms / 2 runs ( 293.78 ms per token) llama_print_timings: total time = 1388.98 ms Output generated in 1.79 seconds (1.11 tokens/s, 2 tokens, context 41, seed 529334198) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 45806636 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 9.23 ms / 43 runs ( 0.21 ms per token) llama_print_timings: prompt eval time = 824.12 ms / 10 tokens ( 82.41 ms per token) llama_print_timings: eval time = 12144.30 ms / 42 runs ( 289.15 ms per token) llama_print_timings: total time = 17187.52 ms Output generated in 17.59 seconds (2.39 tokens/s, 42 tokens, context 45, seed 493354527) Output generated in 0.44 seconds (0.00 tokens/s, 0 tokens, context 87, seed 795152686) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 27456556 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 2.16 ms / 11 runs ( 0.20 ms per token) llama_print_timings: prompt eval time = 595.82 ms / 7 tokens ( 85.12 ms per token) llama_print_timings: eval time = 3037.12 ms / 10 runs ( 303.71 ms per token) llama_print_timings: total time = 4818.01 ms Output generated in 5.26 seconds (1.90 tokens/s, 10 tokens, context 42, seed 364646693) Llama._create_completion: cache miss Llama.generate: prefix-match hit Llama._create_completion: cache save Llama.save_state: saving 87749676 bytes of llama state

llama_print_timings: load time = 3504.17 ms llama_print_timings: sample time = 26.13 ms / 124 runs ( 0.21 ms per token) llama_print_timings: prompt eval time = 657.34 ms / 9 tokens ( 73.04 ms per token) llama_print_timings: eval time = 35427.72 ms / 123 runs ( 288.03 ms per token) llama_print_timings: total time = 48059.42 ms Output generated in 48.44 seconds (2.54 tokens/s, 123 tokens, context 44, seed 1819441613) Output generated in 0.39 seconds (0.00 tokens/s, 0 tokens, context 382, seed 39045789)

reopened cuz i think this issue is related to this error ps i updated to the latest version of llama.cpp, also could be the outdated model. just gotta wait for the ggmlv3 models to finish

yesbroc commented 1 year ago

ggmlv3 isnt supported by ooba 💀

EleutherAI_pythia-410m-deduped
Manticore-13B.ggmlv3.q5_1.bin
Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin
Wizard-Vicuna-13B-Uncensored.ggml.q5_0.bin
wizardLM-7B.ggmlv3.q5_1.bin
zzzzzzzzzzzzzzzzzzzzzzz]

Which one do you want to load? 1-6

3

INFO:Loading Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin... INFO:llama.cpp weights detected: models\Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin

INFO:Cache capacity is 8000 bytes llama.cpp: loading model from models\Wizard-Vicuna-7B-Uncensored.ggmlv3.q5_1.bin error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file? llama_init_from_file: failed to load model Traceback (most recent call last): File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\server.py", line 998, in shared.model, shared.tokenizer = load_model(shared.model_name) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 95, in load_model output = load_func(model_name) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 258, in llamacpp_loader model, tokenizer = LlamaCppModel.from_pretrained(model_file) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py", line 50, in from_pretrained self.model = Llama(**params) File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\llama_cpp\llama.py", line 161, in init assert self.ctx is not None AssertionError Exception ignored in: <function LlamaCppModel.del at 0x000001B758569BD0> Traceback (most recent call last): File "C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\llamacpp_model.py", line 23, in del self.model.del() AttributeError: 'LlamaCppModel' object has no attribute 'model'

Done!

oobabooga / text-generation-webui