Parallel MSM load balancing: minimize work per-thread

Currently the MSM scheduler is minimizing the global number of additions and doublings.

However to benefits from maximum parallelism it might be worthwhile to minimize per-thread number of additions and doublings even if there are slightly more globally.

Motivating example:

256 inputs require c = 7
After endomorphism acceleration we have coefficients of 128 bits hence 128/7 = 18.29 mini-MSMs.
On a 16 threads machine, you would wait for 2 rounds of mini-MSMs with 15 out of 18 threads idle at the second round.
This can be fixed with latency hiding but you can only do so-much if the imbalance is that large.
Here moving to c = 8 for an exact 16-level parallelization or c = 4 for 32 would better utilize the cores.
Note that if cores are not homogeneous with one 3x faster than the other, we're at a loss with exact work division.

mratsim / constantine

Parallel MSM load balancing: minimize work per-thread #451