Open LoganDark opened 6 months ago
What kind of model / api was it hooked up to?
Thought that would be in the debug json, but I've tried with both LLaMA 2 and Mixtral 8x7B in GGUF format, running on KoboldCPP (with cuBLAS and full offload to a 3090). I'm using the KoboldAI United UI (localhost:5000, not lite).
United can't stream over the API thats why streaming is missing.
What do you mean it can't stream over the API? So it can't stream at all?
It can stream when you use huggingface based models in the main UI.
So I can't use my 3090 to run models? Or I can't use GGUF files?
You can't use GGUF's combined with United combined with streaming. You can use it when you directly use Koboldcpp in its own bundled KoboldAI Lite.
OK, so the solution is to not use GGUF then? the lite UI is mostly unusable for me (it works fine, it just has an awful user experience)
Yes, the backends built in to KoboldAI United should work (Huggingface, exllama2)
kobold_debug.json
For some reason token streaming just does not work. It's enabled and the actual terminal output from the server updates every token but no messages are actually sent over websocket to the UI so it can't be displayed until the response is complete. No idea what is going on.
I'm on the latest United commit 1e985ed51bcdde506c223a93017aa05647792063.