Closed RoboMWM closed 3 weeks ago
I've experimented with setting num_thread
to 4 and 3 (on a Standard_F4s_v2
, 8GB ram, 4vCPUs), and found that it performed either the same or consistently worse than just having it removed. When it's removed it seems to only use 2 cores (200% cpu in top
)
Did some searching, seems that memory throughput is the bottleneck. https://www.reddit.com/r/LocalLLaMA/comments/13upwrl/cpu_only_performance/ and https://www.reddit.com/r/LocalLLaMA/comments/1csgnbh/how_to_optimize_ollama_for_cpuonly_inference/l452no8/ (recommending no more than 4-6) as a few threads I found.
I think the options for running against the ollama Api are outdated in our codebase.
Might be work investigating whether or not just cutting it or moving towards 4-6 like you mention.
We've done tests with powerful hardware and crappy (laptops) hardware.
We'll definitely see if just removing would be nice as it is hardcoded too.
@RoboMWM
Another thing to think about (I guess) is the droppoff on the use of more cores.
Yes, the one we have is arbitrary and utilizes the cores on a laptop I have. Likely having maybe only 2-3 is a way to go. Also cutting off how much memory and CPU time it can prevent any possible overusage too.
Regardless. Yes, less is likely more for Ollama in this case.
@RoboMWM
When testing with an azure VM, I was wondering why
ollama run
ran significantly faster than discord-ollama. I poked around to see how discord-ollama calls ollama, and found https://github.com/kevinthedang/discord-ollama/blob/5efc7f00f2e6f94360c63cebe52c8f16d68ecaf3/src/utils/streamHandler.ts#L14 Commenting that out made it run much faster. Is there a reason why this is set to this? Should it match the amount of available cores?