Removing `num_thread` option greatly speeds up generation and reduces cpu usage

kevinthedang / discord-ollama

Discord Bot that utilizes Ollama to interact with any Large Language Models to talk with users and allow them to host/create their own models.

Creative Commons Attribution 4.0 International

42 stars 4 forks source link

Removing `num_thread` option greatly speeds up generation and reduces cpu usage #66

Closed RoboMWM closed 3 weeks ago

RoboMWM commented 3 weeks ago

When testing with an azure VM, I was wondering why ollama run ran significantly faster than discord-ollama. I poked around to see how discord-ollama calls ollama, and found https://github.com/kevinthedang/discord-ollama/blob/5efc7f00f2e6f94360c63cebe52c8f16d68ecaf3/src/utils/streamHandler.ts#L14 Commenting that out made it run much faster. Is there a reason why this is set to this? Should it match the amount of available cores?

RoboMWM commented 3 weeks ago

I've experimented with setting num_thread to 4 and 3 (on a Standard_F4s_v2, 8GB ram, 4vCPUs), and found that it performed either the same or consistently worse than just having it removed. When it's removed it seems to only use 2 cores (200% cpu in top)

RoboMWM commented 3 weeks ago

Did some searching, seems that memory throughput is the bottleneck. https://www.reddit.com/r/LocalLLaMA/comments/13upwrl/cpu_only_performance/ and https://www.reddit.com/r/LocalLLaMA/comments/1csgnbh/how_to_optimize_ollama_for_cpuonly_inference/l452no8/ (recommending no more than 4-6) as a few threads I found.

kevinthedang commented 3 weeks ago

I think the options for running against the ollama Api are outdated in our codebase.

Might be work investigating whether or not just cutting it or moving towards 4-6 like you mention.

We've done tests with powerful hardware and crappy (laptops) hardware.

We'll definitely see if just removing would be nice as it is hardcoded too.

@RoboMWM

kevinthedang commented 3 weeks ago

Another thing to think about (I guess) is the droppoff on the use of more cores.

Yes, the one we have is arbitrary and utilizes the cores on a laptop I have. Likely having maybe only 2-3 is a way to go. Also cutting off how much memory and CPU time it can prevent any possible overusage too.

Regardless. Yes, less is likely more for Ollama in this case.

@RoboMWM