Running a 24 core CPU, I see the load-model screen defaults at 18 threads, though I can override it by typing in whatever number I want. When I have the LLM running, I'm seeing CPU utilization @ about 37.5% which is 18/48 (24 cores, 48 threads) so only 18 are running. Put a lower number that 18, my total utilization will drop accordingly, so it appears that 18 threads is the max.
I can see that it's 'preferring' physical cores over hyperthreaded ones, which is fine - just curious if this limitation will be increased over time or if it's a hard limitation. I mean, the whole concept of CUDA is parallelization of workloads, so for large models that need to spillover on CPU, would be nice to have it perform better.
Running a 24 core CPU, I see the load-model screen defaults at 18 threads, though I can override it by typing in whatever number I want. When I have the LLM running, I'm seeing CPU utilization @ about 37.5% which is 18/48 (24 cores, 48 threads) so only 18 are running. Put a lower number that 18, my total utilization will drop accordingly, so it appears that 18 threads is the max.
I can see that it's 'preferring' physical cores over hyperthreaded ones, which is fine - just curious if this limitation will be increased over time or if it's a hard limitation. I mean, the whole concept of CUDA is parallelization of workloads, so for large models that need to spillover on CPU, would be nice to have it perform better.