Open ThiloteE opened 3 months ago
I ran a small test on an instruction set over the versions seen in this image. Using temp 0.001 to avoid potential Temp-0 problem.
I found the response itself changes in these 3 versions. 2.7.5, 2.8.0, 3.1.0 when using CPU.
The following is a report on speed variance for each version. I took the fastest reported speed on a single prompt regenerated a few times.
What model? Maybe it is model related after all. Partial offloading or full offloading? If it is GPU, I usually offload 19 layers partially.
I will do more precise testing later.
I was using the known good model, Llama 3 8b instruct. I used all layers on GPU. I used 8k context since that is the limit of the model. I don't like to offload, for me using CPU will usually work better if the model is too big for the GPU.
Bug Report
There was a noticeable slowdown of doing inference on LLMs. Something like 30-40% less tokens / second. This change affected both CPU, Cuda and Vulkan backends. This regression still has not been fixed GPT4All-Chat 3.2.1
Steps to Reproduce
Upgrade from GPT4All-Chat 3.0 to GPT4All-Chat 3.1
Expected Behavior
No slowdown.
Your Environment
Hypothesis for root cause
Here is the changelog for version 3.1: https://github.com/nomic-ai/gpt4all/releases/tag/v3.1.0
I highly suspect updating llama.cpp at https://github.com/nomic-ai/gpt4all/pull/2694 introduced this regression, but who knows. Would need to do a git bisect.