Open Narsil opened 1 year ago
this formula here seems pretty cryptic, is there some reasoning behind it?
let n_threads = std::cmp::max(1, std::cmp::min(max_threads, (total_work - threading_threshold + 1) / threading_threshold));
threading_threshold is what you had before to get num_threads=1 vs num_thread=all
.
(total_work - threading_threshold + 1) / threading_threshold
Is simply ceil(total_work/threading_threshold)
(To get a heuristic on how many threads this looks ok to share.
min(max_threads, X)
is to not use more threads than requested
max(1, X)
is to use at least 1.
I'm not sure that this change is optimal by any means.
But it does yield a significant improvement when running relatively small matmul over a 48 core machine.
Before:
After:
At least we're not slowing down drastically (but this is not an improvement either)