Currently the MSM scheduler is minimizing the global number of additions and doublings.
However to benefits from maximum parallelism it might be worthwhile to minimize per-thread number of additions and doublings even if there are slightly more globally.
Motivating example:
256 inputs require c = 7
After endomorphism acceleration we have coefficients of 128 bits hence 128/7 = 18.29 mini-MSMs.
On a 16 threads machine, you would wait for 2 rounds of mini-MSMs with 15 out of 18 threads idle at the second round.
This can be fixed with latency hiding but you can only do so-much if the imbalance is that large.
Here moving to c = 8 for an exact 16-level parallelization or c = 4 for 32 would better utilize the cores.
Note that if cores are not homogeneous with one 3x faster than the other, we're at a loss with exact work division.
Currently the MSM scheduler is minimizing the global number of additions and doublings.
However to benefits from maximum parallelism it might be worthwhile to minimize per-thread number of additions and doublings even if there are slightly more globally.
Motivating example: