Closed cuyler72 closed 1 month ago
@cuyler72 A valid concern, thanks. This application uses LLM APIs in an unusual way to send large numbers of small requests serially. I've added a note to the README.
@cuyler72 If you're interested, the latest version has switched to breadth-first search (it was depth before) and supports pipeline parallel issuing of LLM requests as a bonus. Try setting the env var LLAMA_PIPELINE_REQUESTS=2 and this should greatly improve performance when latency is high!
This isn't really a issue, just info.
I noticed that If the web app and llama.cpp are run on different clients the generation speed is sowed down immensely, having the webui on the same server and connecting directly to it is vastly superior to ruining the webui locally with the API streamed over the internet, This should probably be mentioned in the readme.