Open Deltrego opened 3 months ago
@Deltrego thanks for the note. We know things are slower than they can be on NVIDIA due to our historical choice of CUDA version. We're going to update our toolchain soon, now that we released 0.3.0. Will update this issue when it's available
Thank you for answering so quickly, saw it now. Apologies, I took some measurements and realized that the GPU offload had reset to 28/32. At 32/32 it appears to run at comparable speed (lms 36.93, 33.66 other +-1 tok/sec). Suggestion still applies somewhat, maybe keep the issue at low priority?
That's great, good to know!
Hello, I noticed that token generation on local GPU is accelerated but still slower than using a similar frontend also powered by llama.cpp that uses the http server and streams the output. Your choice of llama.dll should avoid http overhead, but I wonder if you are generating and decoding one token at a time instead of a bulk request and/or using a shared thread instead of a producer-consumer approach where decoding + output is separate from the generation (so it doesn't wait for the word to be printed to GUI before requesting the next). Or maybe GPU latency is not hidden by async data transfers. Lacking the source code I am guessing. I am not familiar with llama.dll so I don't know if it handles multitoken requests or asynchronous streams.
Running lm studio 0.3.0 on Windows 10.0.19045.4651 using NVIDIA GeForce RTX 2060 (6GB). Model fits on GPU.