serge-chat / serge

A web interface for chatting with Alpaca through llama.cpp. Fully dockerized, with an easy to use API.
https://serge.chat
Apache License 2.0
5.68k stars 405 forks source link

šŸ› [Bug]: New install - response keeps repeating the last line #1182

Open DeadEnded opened 9 months ago

DeadEnded commented 9 months ago

Bug description

I just pulled the image, spun up a container with default settings. I downloaded the Mistral-7B model, and left everything default. I've tried a few short questions, and the answer repeats the last line until I stop the container.

Steps to reproduce

1) Spin up new container with default settings (from repo) 2) Download Mistral-7B 3) Start a new chat and ask "what is the square root of nine"

Environment Information

Docker version: 25.0.3 OS: Ubuntu 22.04.4 LTS on kernel 5.15.0-97 CPU: AMD Ryzen 5 2400G Broswer: Firefox version 123.0

Screenshots

image

Relevant log output

llm_load_print_meta: BOS token        = 1 '<s>'

llm_load_print_meta: EOS token        = 2 '</s>'

llm_load_print_meta: UNK token        = 0 '<unk>'

llm_load_print_meta: LF token         = 13 '<0x0A>'

llm_load_tensors: ggml ctx size =    0.11 MiB

llm_load_tensors: offloading 0 repeating layers to GPU

llm_load_tensors: offloaded 0/33 layers to GPU

llm_load_tensors:        CPU buffer size =  4165.37 MiB

...............................................................................................

llama_new_context_with_model: n_ctx      = 2153

llama_new_context_with_model: freq_base  = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init:        CPU KV buffer size =   269.13 MiB

llama_new_context_with_model: KV self size  =  269.12 MiB, K (f16):  134.56 MiB, V (f16):  134.56 MiB

llama_new_context_with_model:        CPU input buffer size   =    12.22 MiB

llama_new_context_with_model:        CPU compute buffer size =   174.42 MiB

llama_new_context_with_model: graph splits (measure): 1

AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 

Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.rope.freq_base': '10000.000000', 'llama.context_length': '32768', 'general.name': 'mistralai_mistral-7b-v0.1', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

Received termination signal!

++ _term

++ echo 'Received termination signal!'

++ kill -TERM 18

++ kill -TERM 19

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

18:signal-handler (1709671894) Received SIGTERM scheduling shutdown...

Confirmations

SolutionsKrezus commented 7 months ago

Hello, I have the same bug when using Mistral or Mixtral for text generation. It keeps repeating the last sentance over and over till I restart the container. I tried increasing the repeat penalty but it does nothing.

fishscene commented 7 months ago

I've noticed this for most, if not all models I can test. This bug essentially makes serge useless. Update Reverting to "ghcr.io/serge-chat/serge:0.8.2" appears to vastly improve or eliminate the repeating issue altogether. Still testing.

gaby commented 7 months ago

This is probably a bug in llama-cpp-python. I will update it this week and do a new release.

Which specific model are you all using? @SolutionsKrezus @fishscene

SolutionsKrezus commented 7 months ago

I'm currently using Mistral 7B and Mixtral @gaby I reverted to 0.8.0 and it works like a charm

fishscene commented 7 months ago

This is probably a bug in llama-cpp-python. I will update it this week and do a new release.

Which specific model are you all using? @SolutionsKrezus @fishscene

Apologies, Iā€™m at work at the moment. All models I tested were affected to some degree. Some more than others.

Off the top of my head: All current mixtral models, at least 2 mistral models, neural chat, one of the medical ones, definitely a few more as well. I did not test anything above 13b as those are beyond my hardware.

I would see random replies marked/flagged as code snippetsā€¦ and if the model started repeating itself, that was the end of anything useful as all subsequent replies would only repeat.

Of all the testing I did, getting 10 coherent replies was a major milestone- and even then, sometimes it took multiple re-prompting (delete my query and ask it slightly differently) to get to 10. A couple models started spewing nonsense and repeats on the very first response.

All this to say, testing should be very easy to do. When I reverted to previous serge release, I immediately saw improvement.

Curious though. OP is using Ryzen- so am I: Ryzen 1700x, 32GB RAM, no CUDA GPU. (NVIDIA T400 I think). Using CPU for AI.

Maybe this is isolated to Ryzen CPUā€™s?

Another behavior to note: When asking some censored models a question, they straight up have no reply at all. No detectable CPU was used either. It was like some pre-AI function was like ā€œnopeā€ and didnā€™t pass along my query to the AI model itself. Thereā€™s a name for this pre-process, but it escapes me at the moment. Not sure if it is a clue either.

SolutionsKrezus commented 7 months ago

I don't think it is a Ryzen-related issue @fishscene I have the same problem with a Intel Xeon D-1540 with 32GB RAM and no GPU.

JuniperChris929 commented 7 months ago

Same issue here. This pretty much renders the software completely useless :(

gaby commented 1 month ago

Can you try ghcr.io/serge-chat/serge:main. Thanks