Currently we run one process with a bench and one process with the search of n-1 threads. This doesn't account for these RAM bandwidth limitations discussed and therefore the measured nps is far faster than the real nps.
Instead, this PR runs a bench process for each active core and takes the average NPS.
I think that's reasonable direction. Things to consider (idk the right answer)
this will penalise SMP tests, where the actual nps will be higher than the one measured in this way. Could be solved by doing some SMP measurement for SMP tests.
I have observed that on very large core workers the 1 second test might actually not be such a good measurement, as the system is spawning engines and only once everything is running the measurement becomes stable.
This effectively changes the TC for the progression test, so will have some effect there. Maybe that's something to consider merging shortly after release (i.e. when we usually update the reference nps?).
This PR modifies the NPS measurement for TC scaling to more closely resemble actual testing conditions. In particular, it addresses the point raised in https://github.com/official-stockfish/fishtest/issues/2077
Instead, this PR runs a bench process for each active core and takes the average NPS.