nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
68.99k stars 7.57k forks source link

CPU t/s regression from GPT4All 2.6.1 -> 2.6.2 #2204

Open IndustrialOne opened 5 months ago

IndustrialOne commented 5 months ago

On my old 4-core computer, I got roughly 2.1 t/s with openorca-chat model. When I upgraded to a 6-core PC, the speed doubled. Other models got the same speed increase. Then two things happened at the same time, I had to use a crappy video card temporarily until I got a new one to replace my broken one, and I upgraded to the latest gpt4all.

My speed dropped to 1.2 t/s for openorca and below 1 for other models. I assumed it was my crappy GPU but I just got my new one that's benchmarked as 3 times faster than my old one and I'm still getting the same slow speed. I downgraded to gpt4all 2.6.1 and my speed is back to 4 t/s for openorca, so it seems the problem was the latest version. Why is this?

cebtenzzre commented 4 months ago

I need some more information:

There were significant changes to the GPU backend in v2.6.2.

IndustrialOne commented 4 months ago

Windows 10 in a VM. I'm using NVIDIA GT1030. I never noticed the device beside the token count before but in both cases it only says CPU. Once again, I got 3.8 t/s on 2.6.1 and 1.6 on 2.7.3. Exact same setup.

So my GPU was never utilized, interesting.

cebtenzzre commented 4 months ago

Thanks for the reply. To help narrow down the issue further, could you try a few versions between 2.6.1 and 2.7.3 to find which specific version caused the slowdown? Here are links to the versions in between:

IndustrialOne commented 4 months ago

Good catch! 2.6.2 appears to be the culprit. 3.2 t/s on 2.6.1, 1.1 t/s on 2.6.2. The output was worse too but that's just my subjective opinion.

cebtenzzre commented 3 months ago

Windows 10 in a VM.

By the way - GPT4All will likely not be able to see your GPU inside of a virtual machine, unless you are a real power user doing PCIe passthrough. It most likely does not appear as an option under Settings > Application > Device.

IndustrialOne commented 3 months ago

Windows 10 in a VM.

By the way - GPT4All will likely not be able to see your GPU inside of a virtual machine, unless you are a real power user doing PCIe passthrough. It most likely does not appear as an option under Settings > Application > Device.

Hey, have you figured out the cause of the regression yet?

My Windows 10 box is not ready, but when I tested it on Windows 10 on the same hardware (not a VM) I got 4.9 t/s, not a huge improvement. However, this was while I was using the onboard GPU so no idea what a GT1030 would do.

IndustrialOne commented 1 month ago

@cebtenzzre I see you removed this from the roadmap, does that mean the issue was fixed?

cebtenzzre commented 1 month ago

I see you removed this from the roadmap, does that mean the issue was fixed?

We cleaned old items out of the roadmap. I haven't been able to reproduce this issue yet, although since we know it's related to using the CPU only in a Windows 10 VM between those specific versions, it seems like it should be possible to figure out. It is almost certainly caused by an upstream llama.cpp change somewhere in this range of commits, although it's a pretty big range and would require a git bisect: https://github.com/ggerganov/llama.cpp/compare/6b0a7420d...fbf1ddec6

IndustrialOne commented 1 month ago

Thanks, please do investigate this. I have stopped upgrading gpt4all because of this. 3-4 t/s is pretty slow but tolerable, 1 t/s is intolerable and makes the whole thing unusable. Will you re-add this to the roadmap?

IndustrialOne commented 3 weeks ago

@cebtenzzre I am willing to narrow this down. Can you compile 2.6.1 with various commits of llama.cpp and I'll tell you when the regression happens? There's 452 commits so maybe give me commit #100 #200 #300 and #400 to help narrow it down? Thanks.

IndustrialOne commented 3 weeks ago

By the way, I installed LM studio and it's way faster than GPT4ALL, any version of it. I'm getting 5 t/s with llama3. Maybe the problem isn't with llama.cpp?