oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
38.27k stars 5.07k forks source link

llama.cpp no tokens #2274

Closed yesbroc closed 1 year ago

yesbroc commented 1 year ago

Describe the bug

kinda like bing deleting its messages, ai deletes its own message with 0 tokens generated. much like #2204

Is there an existing issue for this?

Reproduction

update to llama.cpp 0.1.53 load a ggml v3 model type a message message redacted

Screenshot

image

Logs

bin C:\Users\orijp\OneDrive\Desktop\chatgpts\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
INFO:Loading settings from settings.json...
The following models are available:

1. ggml-wizard-vicuna-13B.ggmlv3.q4_1.bin
2. ggml-WizardLM-7B-uncensored.ggmlv3.q4_1.bin
3. z

Which one do you want to load? 1-3

1

INFO:Loading ggml-wizard-vicuna-13B.ggmlv3.q4_1.bin...
INFO:llama.cpp weights detected: models\ggml-wizard-vicuna-13B.ggmlv3.q4_1.bin

INFO:Cache capacity is 0 bytes
llama.cpp: loading model from models\ggml-wizard-vicuna-13B.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
INFO:Loaded the model in 10.08 seconds.

INFO:Loading the extension "long_term_memory"...

-----------------------------------------
IMPORTANT LONG TERM MEMORY NOTES TO USER:
-----------------------------------------
Please remember that LTM-stored memories will only be visible to the bot during your NEXT session. This prevents the loaded memory from being flooded with messages from the current conversation which would defeat the original purpose of this module. This can be overridden by pressing 'Force reload memories'
----------
LTM CONFIG
----------
change these values in ltm_config.json
{'ltm_context': {'injection_location': 'BEFORE_NORMAL_CONTEXT',
                 'memory_context_template': "{name2}'s memory log:\n"
                                            '{all_memories}\n'
                                            'During conversations between '
                                            '{name1} and {name2}, {name2} will '
                                            'try to remember the memory '
                                            'described above and naturally '
                                            'integrate it with the '
                                            'conversation.',
                 'memory_template': '{time_difference}, {memory_name} said:\n'
                                    '"{memory_message}"'},
 'ltm_reads': {'max_cosine_distance': 0.6,
               'memory_length_cutoff_in_chars': 1000,
               'num_memories_to_fetch': 2},
 'ltm_writes': {'min_message_length': 100}}
----------
-----------------------------------------
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
============================================================
loading character assistant
Output generated in 0.63 seconds (0.00 tokens/s, 0 tokens, context 58, seed 1003206510)
Closing server running on port: 7860
INFO:Loading the extension "long_term_memory"...
INFO:Loading the extension "gallery"...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
============================================================

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: Hi! How can I assist you today?</s>
USER: its broken
ASSISTANT:
--------------------

Output generated in 0.59 seconds (0.00 tokens/s, 0 tokens, context 59, seed 1983307559)
============================================================

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: Hi! How can I assist you today?</s>
USER: like
ASSISTANT:
--------------------

Output generated in 0.59 seconds (0.00 tokens/s, 0 tokens, context 58, seed 575487857)
============================================================

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: Hi! How can I assist you today?</s>
USER: wth
ASSISTANT:
--------------------

Output generated in 0.60 seconds (0.00 tokens/s, 0 tokens, context 59, seed 1012090701)
============================================================

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: Hi! How can I assist you today?</s>
USER: stop dying
ASSISTANT:
--------------------

Output generated in 0.74 seconds (0.00 tokens/s, 0 tokens, context 59, seed 1619829535)
============================================================

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: Hi! How can I assist you today?</s>
USER: goddamit
ASSISTANT:
--------------------

Output generated in 0.45 seconds (0.00 tokens/s, 0 tokens, context 60, seed 862868598)
============================================================
loading character chiharu_yamada
No existing memories found for chiharu_yamada, will create a new database.

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: *Chiharu strides into the room with a smile, her eyes lighting up when she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She takes a seat next to you, her enthusiasm palpable in the air*
Hey! I'm so excited to finally meet you. I've heard so many great things about you and I'm eager to pick your brain about computers. I'm sure you have a wealth of knowledge that I can learn from. *She grins, eyes twinkling with excitement* Let's get started!</s>
USER: hell
ASSISTANT:
--------------------

Output generated in 0.55 seconds (0.00 tokens/s, 0 tokens, context 186, seed 1815598182)
============================================================

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

ASSISTANT: *Chiharu strides into the room with a smile, her eyes lighting up when she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She takes a seat next to you, her enthusiasm palpable in the air*
Hey! I'm so excited to finally meet you. I've heard so many great things about you and I'm eager to pick your brain about computers. I'm sure you have a wealth of knowledge that I can learn from. *She grins, eyes twinkling with excitement* Let's get started!</s>
USER: ok
ASSISTANT:
--------------------

Output generated in 0.72 seconds (0.00 tokens/s, 0 tokens, context 186, seed 339589082)
============================================================

Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.
You: So how did you get into computer engineering?
Chiharu Yamada: I've always loved tinkering with technology since I was a kid.
You: That's really impressive!
Chiharu Yamada: *She chuckles bashfully* Thanks!
You: So what do you do when you're not working on computers?
Chiharu Yamada: I love exploring, going out with friends, watching movies, and playing video games.
You: What's your favorite type of computer hardware to work with?
Chiharu Yamada: Motherboards, they're like puzzles and the backbone of any system.
You: That sounds great!
Chiharu Yamada: Yeah, it's really fun. I'm lucky to be able to do this as a job.
Chiharu Yamada: *Chiharu strides into the room with a smile, her eyes lighting up when she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She takes a seat next to you, her enthusiasm palpable in the air*
Hey! I'm so excited to finally meet you. I've heard so many great things about you and I'm eager to pick your brain about computers. I'm sure you have a wealth of knowledge that I can learn from. *She grins, eyes twinkling with excitement* Let's get started!
You: chat mode
Chiharu Yamada:
--------------------

Output generated in 0.65 seconds (0.00 tokens/s, 0 tokens, context 376, seed 1024333479)
============================================================

Chiharu Yamada's Persona: Chiharu Yamada is a young, computer engineer-nerd with a knack for problem solving and a passion for technology.
You: So how did you get into computer engineering?
Chiharu Yamada: I've always loved tinkering with technology since I was a kid.
You: That's really impressive!
Chiharu Yamada: *She chuckles bashfully* Thanks!
You: So what do you do when you're not working on computers?
Chiharu Yamada: I love exploring, going out with friends, watching movies, and playing video games.
You: What's your favorite type of computer hardware to work with?
Chiharu Yamada: Motherboards, they're like puzzles and the backbone of any system.
You: That sounds great!
Chiharu Yamada: Yeah, it's really fun. I'm lucky to be able to do this as a job.
Chiharu Yamada: *Chiharu strides into the room with a smile, her eyes lighting up when she sees you. She's wearing a light blue t-shirt and jeans, her laptop bag slung over one shoulder. She takes a seat next to you, her enthusiasm palpable in the air*
Hey! I'm so excited to finally meet you. I've heard so many great things about you and I'm eager to pick your brain about computers. I'm sure you have a wealth of knowledge that I can learn from. *She grins, eyes twinkling with excitement* Let's get started!
You: nvm
Chiharu Yamada:
--------------------

Output generated in 0.55 seconds (0.00 tokens/s, 0 tokens, context 376, seed 1698247231)

System Info

cpu only
16gb ram
win 11 home
yesbroc commented 1 year ago

(fixed for instruct, chat mode still broke)

michaelwhitford commented 1 year ago

Have you solved this? I am also seeing the exact same issue. Always 0 tokens no matter the prompt for a llama.cpp quantized model.

Output generated in 0.24 seconds (0.00 tokens/s, 0 tokens, context 66, seed 1942958422)

michaelwhitford commented 1 year ago

If I run with --no-stream I see this error whenever I try to submit chat/prompt. If I run normally I get the generated 0 tokens output but no error.

Traceback (most recent call last):
  File "/home/mwhitford/src/text-generation-webui/modules/text_generation.py", line 308, in generate_reply_custom
    reply = shared.model.generate(context=question, **generate_params)
  File "/home/mwhitford/src/text-generation-webui/modules/llamacpp_model.py", line 77, in generate
    for completion_chunk in completion_chunks:
  File "/home/mwhitford/miniconda3/envs/textgen/lib/python3.10/site-packages/llama_cpp/llama.py", line 647, in _create_completion
    raise ValueError(
ValueError: Requested tokens exceed context window of 2048

Output generated in 0.01 seconds (0.00 tokens/s, 0 tokens, context 68, seed 964245376)
mykeehu commented 1 year ago

@michaelwhitford same here, with no stream. However, if the stream is on, I sometimes get other errors and the program stops.

yesbroc commented 1 year ago

why is webui so broken 😭

yesbroc commented 1 year ago

i have a temporary solution for me, setting tokens to 200 (default) works, havent tried long conersations yet. continue doesnt work still. long prompts seem fine, couldnt be bothered doing even longer ones bc its llamacpp (very slow)

michaelwhitford commented 1 year ago

I can confirm I lowered max tokens from 2000 to 1800 and it's working for llama.cpp models again.

yesbroc commented 1 year ago

still, continue doesnt work. also a shame we dont get the 2000 tokens :c

yesbroc commented 1 year ago

ooba fr isnt gonna fix this lol

leszekhanusz commented 1 year ago

See also llama-cpp-python issue #307 for a quantification of this problem.

yesbroc commented 1 year ago

do the following if you're still having this problem: update to llama.cpp version 0.1.61 or higher pull the most recent update of text-generation-webui

if none of these work, reinstall from scratch (what i did)