Incorrect context size with llama.cpp

Dampfinchen commented 1 year ago

Describe the bug

Hello,

first off, I'm using Windows with Llama.cpp that has cuBLAS activated.

I've noticed that with newer Ooba versions, the context size of llama is incorrect and around 900 tokens even though I've set it to max ctx for my llama based model (n_ctx=2048). Llama.cpp recognizes the correct set ctx of 2048. But when I sent a prompt that is larger than around 900 tokens, the AI would output nothing (btw, there should be a more elegant way to handle larger prompt input than this). When I'm chatting with the AI, it would forget stuff at around 900 tokens. So something here is not right and the reason for this might be recent changes as I've not had this issue to this level before.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

load a llama.cpp model and sent a prompt that is large (atleast 1k tokens, but 1.800 tokens if you want to be on the safe side). You will see that the AI generates no output.

Screenshot

No response

Logs

0 Tokens generated in 0.00 seconds (0 tokens/second)

System Info

RTX 2060, Core i7 9750H, Windows 11

Priestru commented 1 year ago

Similar problem but my text works only around 1600 tokens. I tested sent prompt with open ai tokenizer and it's indeed nowhere close to expected 2k context length. For some reason it doesn't use full available context and i have no idea why.

Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1934, seed 1642416811)
Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 1865, seed 56475296)
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, context 1801, seed 764261126)
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, context 1801, seed 284498499)

llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Llama.generate: prefix-match hit
127.0.0.1 - - [06/Jun/2023 23:34:49] "GET /api/v1/model HTTP/1.1" 200 -

llama_print_timings:        load time = 17640.66 ms
llama_print_timings:      sample time =     5.04 ms /    30 runs   (    0.17 ms per token)
llama_print_timings: prompt eval time = 43938.51 ms /  1627 tokens (   27.01 ms per token)
llama_print_timings:        eval time = 20225.53 ms /    29 runs   (  697.43 ms per token)
llama_print_timings:       total time = 64247.48 ms
Output generated in 64.54 seconds (0.45 tokens/s, 29 tokens, context 1628, seed 1705199859)

Tested it not via api but in webui directly. Indeed it somehow dislike long prompts.

Midhra8 commented 1 year ago

Having the same issue. I can generate up to ~1800 tokens and then it starts taking a long time to generate one sentence. (I have the feeling it regenerates the whole chat history)

Priestru commented 1 year ago

Having the same issue. I can generate up to ~1800 tokens and then it starts taking a long time to generate one sentence. (I have the feeling it regenerates the whole chat history)

It's different issue. Somehow your state cache fails to match to speed up evaluation for you. Update webui, make sure to build https://github.com/abetlen/llama-cpp-python with blas support.

Dampfinchen commented 1 year ago

Similar problem but my text works only around 1600 tokens. I tested sent prompt with open ai tokenizer and it's indeed nowhere close to expected 2k context length. For some reason it doesn't use full available context and i have no idea why.

Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1934, seed 1642416811)
Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 1865, seed 56475296)
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, context 1801, seed 764261126)
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, context 1801, seed 284498499)

llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Llama.generate: prefix-match hit
127.0.0.1 - - [06/Jun/2023 23:34:49] "GET /api/v1/model HTTP/1.1" 200 -

llama_print_timings:        load time = 17640.66 ms
llama_print_timings:      sample time =     5.04 ms /    30 runs   (    0.17 ms per token)
llama_print_timings: prompt eval time = 43938.51 ms /  1627 tokens (   27.01 ms per token)
llama_print_timings:        eval time = 20225.53 ms /    29 runs   (  697.43 ms per token)
llama_print_timings:       total time = 64247.48 ms
Output generated in 64.54 seconds (0.45 tokens/s, 29 tokens, context 1628, seed 1705199859)

Tested it not via api but in webui directly. Indeed it somehow dislike long prompts.

Yup, exactly my problem. If you sent a huge prompt first you will likely encounter this, even if its well below 2048 tokens. Pretty strange.

I've had this problem before, but now its get much more aggressive.

Also I feel like even if your prompt is over the 2048 ctx limit, there should be a better way to handle this than an empty response. Maybe you could process it in batches or something. I think Koboldcpp does something like that.

Priestru commented 1 year ago

Similar problem but my text works only around 1600 tokens. I tested sent prompt with open ai tokenizer and it's indeed nowhere close to expected 2k context length. For some reason it doesn't use full available context and i have no idea why.
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1934, seed 1642416811)
Output generated in 0.32 seconds (0.00 tokens/s, 0 tokens, context 1865, seed 56475296)
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, context 1801, seed 764261126)
Output generated in 0.28 seconds (0.00 tokens/s, 0 tokens, context 1801, seed 284498499)
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
llama_tokenize: too many tokens
Llama.generate: prefix-match hit
127.0.0.1 - - [06/Jun/2023 23:34:49] "GET /api/v1/model HTTP/1.1" 200 -

llama_print_timings:        load time = 17640.66 ms
llama_print_timings:      sample time =     5.04 ms /    30 runs   (    0.17 ms per token)
llama_print_timings: prompt eval time = 43938.51 ms /  1627 tokens (   27.01 ms per token)
llama_print_timings:        eval time = 20225.53 ms /    29 runs   (  697.43 ms per token)
llama_print_timings:       total time = 64247.48 ms
Output generated in 64.54 seconds (0.45 tokens/s, 29 tokens, context 1628, seed 1705199859)
Tested it not via api but in webui directly. Indeed it somehow dislike long prompts.
Yup, exactly my problem. If you sent a huge prompt first you will likely encounter this, even if its well below 2048 tokens. Pretty strange.

I've had this problem before, but now its get much more aggressive.

Also I feel like even if your prompt is over the 2048 ctx limit, there should be a better way to handle this than an empty response. Maybe you could process it in batches or something. I think Koboldcpp does something like that.

Guess ooba has nothing to do with the issue it must lie within https://github.com/abetlen/llama-cpp-python

Priestru commented 1 year ago

I tricked it into working by increasing n_ctx to 2400

Output generated in 59.23 seconds (0.20 tokens/s, 12 tokens, context 2049, seed 461475505)

Output generated in 111.17 seconds (1.03 tokens/s, 114 tokens, context 1925, seed 124993566)

Okay what i've done.

I set n_ctx to 2400 and in SillyTavern i set Token Padding to -260. This allows me to use around 1950 tokens. Which is good enough and much better than 1650.

Output generated in 133.02 seconds (0.92 tokens/s, 123 tokens, context 1925, seed 514921505)

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

oobabooga / text-generation-webui