the-crypt-keeper / LLooM

Experimental LLM Inference UX to aid in creative writing
MIT License
74 stars 6 forks source link

This application is highly senstive to latency. #5

Closed cuyler72 closed 1 month ago

cuyler72 commented 1 month ago

This isn't really a issue, just info.

I noticed that If the web app and llama.cpp are run on different clients the generation speed is sowed down immensely, having the webui on the same server and connecting directly to it is vastly superior to ruining the webui locally with the API streamed over the internet, This should probably be mentioned in the readme.

the-crypt-keeper commented 1 month ago

@cuyler72 A valid concern, thanks. This application uses LLM APIs in an unusual way to send large numbers of small requests serially. I've added a note to the README.

the-crypt-keeper commented 1 month ago

@cuyler72 If you're interested, the latest version has switched to breadth-first search (it was depth before) and supports pipeline parallel issuing of LLM requests as a bonus. Try setting the env var LLAMA_PIPELINE_REQUESTS=2 and this should greatly improve performance when latency is high!