mudler / LocalAI

:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.
https://localai.io
MIT License
22.12k stars 1.69k forks source link

llama.cpp: infinite loop of context switch #1333

Closed mudler closed 5 months ago

mudler commented 8 months ago

This card is a tracker for https://github.com/ggerganov/llama.cpp/issues/3969

This seems to happen to me as well with the llama.cpp backend only: I can reproduce it programmatically with certain text by using grammars

Update:

There is an "epic" here that we should keep an eye on: https://github.com/ggerganov/llama.cpp/issues/4216

localai-bot commented 8 months ago

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

I'm sorry, I don't understand what you mean by "this card is a tracker for https://github.com/ggerganov/llama.cpp/issues/3969". Can you please provide more context or clarify the issue you are referring to?

Sources:

mudler commented 8 months ago

a workaround (not a solution) is available in https://github.com/mudler/LocalAI/issues/1339 - it still doesn't handle gracefully connections, but at least avoids the API to stall indefinetly

dionysius commented 6 months ago

What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK.

Model: TinyLlama-1.1B-Chat-v1.0 Config:

context_size: 1024
name: se-chat
parameters:
  model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 1
gpu_layers: 100
[...]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr llama_new_context_with_model: total VRAM used: 701.02 MiB (model: 601.02 MiB, context: 100.00 MiB)
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr Available slots:
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr  -> Slot 0 - max context: 1024
4:26PM INF [llama-cpp] Loads OK
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 is processing [task id: 0]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 : kv cache rm - [0, end)
4:45PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0: context shift - n_keep = 0, n_left = 1022, n_discard = 511
[...]

Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it tinyllama-custom:

root@be406c0fa880:/srv/custom/models# cat tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
FROM /srv/custom/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant."""
PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"
root@be406c0fa880:/srv/custom/models# ollama create tinyllama-custom -f tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
transferring model data
creating model layer
creating template layer
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0
using already created layer sha256:af0ddbdaaa26f30d54d727f9dd944b76bdb926fdaf9a58f63f78c532f57c191f
using already created layer sha256:c8472cd9daed5e7c20aa53689e441e10620a002aacd58686aeac2cb188addb5c
using already created layer sha256:fa956ab37b8c21152f975a7fcdd095c4fee8754674b21d9b44d710435697a00d
writing layer sha256:2dc31b4921bcadca51ac93b60788e51c18955c48cab69d429ce922c0aa67ab82
writing manifest
success
λ ollama run tinyllama-custom
>>> How old is Mickey Mouse?
How old is Mickey Mouse?
As of now (2021), Mickey Mouse is 93 years old.

>>> Send a message (/? for help)
mudler commented 6 months ago

What is weird to me is that I don't have this issue with ollama and they are using llama.cpp as well AFAIK.

Model: TinyLlama-1.1B-Chat-v1.0 Config:

context_size: 1024
name: se-chat
parameters:
  model: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: chat
  chat_message: chatml
  completion: completion
threads: 1
gpu_layers: 100
[...]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr llama_new_context_with_model: total VRAM used: 701.02 MiB (model: 601.02 MiB, context: 100.00 MiB)
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr Available slots:
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr  -> Slot 0 - max context: 1024
4:26PM INF [llama-cpp] Loads OK
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 is processing [task id: 0]
4:26PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0 : kv cache rm - [0, end)
4:45PM DBG GRPC(tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf-127.0.0.1:35075): stderr slot 0: context shift - n_keep = 0, n_left = 1022, n_discard = 511
[...]

Just to be sure they are using exactly the same model I didn't pull the model with ollama. I downloaded and imported it manually using a modelfile based on the original and named it tinyllama-custom:

root@be406c0fa880:/srv/custom/models# cat tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
FROM /srv/custom/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
TEMPLATE """<|system|>
{{ .System }}</s>
<|user|>
{{ .Prompt }}</s>
<|assistant|>
"""
SYSTEM """You are a helpful AI assistant."""
PARAMETER stop "<|system|>"
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "</s>"
root@be406c0fa880:/srv/custom/models# ollama create tinyllama-custom -f tinyllama-1.1b-chat-v1.0.Q4_K_M.modelfile
transferring model data
creating model layer
creating template layer
creating system layer
creating parameters layer
creating config layer
using already created layer sha256:9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0
using already created layer sha256:af0ddbdaaa26f30d54d727f9dd944b76bdb926fdaf9a58f63f78c532f57c191f
using already created layer sha256:c8472cd9daed5e7c20aa53689e441e10620a002aacd58686aeac2cb188addb5c
using already created layer sha256:fa956ab37b8c21152f975a7fcdd095c4fee8754674b21d9b44d710435697a00d
writing layer sha256:2dc31b4921bcadca51ac93b60788e51c18955c48cab69d429ce922c0aa67ab82
writing manifest
success
λ ollama run tinyllama-custom
>>> How old is Mickey Mouse?
How old is Mickey Mouse?
As of now (2021), Mickey Mouse is 93 years old.

>>> Send a message (/? for help)

ollama doesn't use the llama.cpp http/server code, indeed https://github.com/ggerganov/llama.cpp/issues/3969 affects only the http implementation. When we switched away from using the bindings - we now rely directly on the llama.cpp code and we build grpc server around it in C++, and that brings us more close to llama.cpp implementation (with eventual bugs attached as well)

mudler commented 5 months ago

@dionysius this is going to be fixed in https://github.com/mudler/LocalAI/pull/1704

mudler commented 5 months ago

This is fixed in LocalAI. Upstream workaround is as well to put a cap on max tokens as the models tends to hallucinate, infinite context shifting might actually lead to infinite answers too (see commit message in https://github.com/mudler/LocalAI/commit/c56b6ddb1cee8b8b2a19ddeb9efdb464e1789f2e ). It was nice to see that upstream confirmed the issue with https://github.com/ggerganov/llama.cpp/issues/3969#issuecomment-1961817156 after the above workaround - it sounds much more safer to not expose the user at all to this by disabling it entirely, and I think what we do is to shield the user from such nuances.

We can look at this again if someone really thinks this is an issue. Closing it for now