Implement an option to limit a test to particular architectures to test non-algorithm-related speedups

maximmasiutin commented 1 year ago

For the reasons not yet known to me, while the perf or physhbench benefits or the patch are negligible if at all positive, on cutechess-cli the gains are solid, which on tournaments may be more than decisive.

@Disservin wrote

man... i once had a local stockfish test that showed +30 +/- 40 or something along the lines and on fishtest it was -2 elo... local cutechess-cli games dont say anything other than it's not crashing imo.

A good way to proof various speedup hypothesis is to have an option for Fishtest to limit the test to certain architectures, like avxvnni, avx512, vnni512 and vnni256. There may be cases where simple benchmarking would not give result, i.e. in multithreading the execution ports are shared but threads that do different workloads may coexist, i.e. one thread does vpdpbusd while the other does integer arithmetic and conditional branching.

In this case, the code that forces bmi2 for avxvnni and vnni512 capable machines should be removed.

Sopel97 commented 1 year ago

There may be cases where simple benchmarking would not give result, i.e. in multithreading

I disagree with this. Just benchmark with multiple threads.

maximmasiutin commented 1 year ago

There may be cases where simple benchmarking would not give result, i.e. in multithreading

I disagree with this. Just benchmark with multiple threads.

Why do benchmarks with multiple threads show no benefits while cutechess-cli with the same number of threads show benefit?

Or should I run more games?

Joachim26 commented 1 year ago

There may be cases where simple benchmarking would not give result, i.e. in multithreading

I disagree with this. Just benchmark with multiple threads.

Why do benchmarks with multiple threads show no benefits while cutechess-cli with the same number of threads show benefit?

Or should I run more games?

Probably yes, because you need an extremely low error margin, which is locally usually very, very time-consuming.

Sopel97 commented 1 year ago

a few hundred thousand, yea

Joachim26 commented 1 year ago

a few hundred thousand, yea

I agree 👍

maximmasiutin commented 1 year ago

a few hundred thousand, yea

OK, I will run 800000 games, TC 60s+0.6s, 16 threads: https://fishtest.masiutin.net/tests/view/643e92515af72ac7dd0a68d8?show_task=0

Disservin commented 1 year ago

What are you doing?! That’s not even the fishtest limit for LTC SMP tests! The test you just started will go on for weeks

maximmasiutin commented 1 year ago

@Disservin wrote:

That’s not even the fishtest limit for LTC SMP tests!

The test will probably finish faster, yielding results (positive or negative) sooner than that limit, I hope.

The test you just started will go on for weeks

Maybe I will manage to find a few more workers.

Joachim26 commented 1 year ago

A bit OT: What is the shortest reasonable TC on fishtest? I guess 3+0.03s should be OK. Is this correct?

maximmasiutin commented 1 year ago

I'm using cutechess-cli with -tb parameter to give cutechess the Syzygy tablebases for adjudication; however, the engines are not using the tablebases. They just help cutechess-cli decide the winner faster without losing time on additional moves. TCEC uses this approach when the tablebases are used by cutechess-cli (their custom version) but not by the engines.

Disservin commented 1 year ago

@maximmasiutin i don’t get why you choose different time control/engine settings at all. If you are on fishtest and select LTC SMP you get 8 threads and 20+0.2, so I don’t get why you chose 16 threads and 60+0.6?!

dubslow commented 1 year ago

yea 10s 1 thread much better choice

Disservin commented 1 year ago

well the point from him was

Why do benchmarks with multiple threads show no benefits while cutechess-cli with the same number of threads show benefit? while now reading that again does really make sense imo, 4vs4 threads is still multiple threads and the same number of threads?

maximmasiutin commented 1 year ago

@maximmasiutin i don’t get why you choose different time control/engine settings at all. If you are on fishtest and select LTC SMP you get 8 threads and 20+0.2, so I don’t get why you chose 16 threads and 60+0.6?!

Suppose a CPU has 16 threads in total. In such case what would be the most realistic use scenario of StockFish? I guess, Threads=16; Ponder=off. That's what I configured. I could also 20+0.2, but I saw people are using 60+0.6 at the main instance.

Sopel97 commented 1 year ago

fishtest operates under very different conditions than a typical user would use stockfish

Disservin commented 1 year ago

but not for smp....it's also not about the most realistic and not about the hardware specs of the worker... the max is 8 and the min is also 8 for stockfish smp tests

maximmasiutin commented 1 year ago

Are we developing for the user or for the fishtest?

Sopel97 commented 1 year ago

you want us to start testing at 32 threads and 1 hour per move?

Disservin commented 1 year ago

Patches that get merged into stockfish follow fishtest rules (mostly) there are very few exceptions and this is not really one imo.

maximmasiutin commented 1 year ago

you want us to start testing at 32 threads and 1 hour per move?

You know better, but my feeling that to test algorithm logic, current defaults of the fishtest main instance are the best. But since the fishtest main instance cannot at the moment be used to test on particular architectures, than we should not try to take the defaults to test algorithm logic and apply them to test vnni512-specifics.

Disservin commented 1 year ago

With fishtest rules i am refering to the default settings for the tests.

SMP being for multithreaded tests

They should be used for all kinds of tests, and your vnni tests should also use them since the result is comparable and a longer tc/ even higher thread count is not important.

maximmasiutin commented 1 year ago

The present issue (wish) https://github.com/glinscott/fishtest/issues/1611 is to allow to test on particular architectures, like only on workers with vnni512. We cannot do that at the moment. I don't dispute the correctness of the current default values for the tests that alter the logic of the algorithms. But if we had an oportunity to test on particular architectures, we could test code modifications that do not alter logic of the algorithms, but just implement them differently - the inputs and the outputs are the same, but the underlying CPU instructions are different. That's what I propose to implement in Fishtest: an option for a test to limit to particular ARCHs.

Disservin commented 1 year ago

Yes, that wish might be totally valid. I was just trying to say that the test you are running on your instance right now, will a) take forever b) doesn’t use the settings that we currently use to accept a test. Which would be 20+0.2 and 8 Threads for Multithreaded Tests.

Sopel97 commented 1 year ago

There is currently no reason to think that bench is somehow insufficient for measuring speedups.

official-stockfish / fishtest

Implement an option to limit a test to particular architectures to test non-algorithm-related speedups #1611