Open maximmasiutin opened 1 year ago
There may be cases where simple benchmarking would not give result, i.e. in multithreading
I disagree with this. Just benchmark with multiple threads.
There may be cases where simple benchmarking would not give result, i.e. in multithreading
I disagree with this. Just benchmark with multiple threads.
Why do benchmarks with multiple threads show no benefits while cutechess-cli with the same number of threads show benefit?
Or should I run more games?
There may be cases where simple benchmarking would not give result, i.e. in multithreading
I disagree with this. Just benchmark with multiple threads.
Why do benchmarks with multiple threads show no benefits while cutechess-cli with the same number of threads show benefit?
Or should I run more games?
Probably yes, because you need an extremely low error margin, which is locally usually very, very time-consuming.
a few hundred thousand, yea
a few hundred thousand, yea
I agree đź‘Ť
a few hundred thousand, yea
OK, I will run 800000 games, TC 60s+0.6s, 16 threads: https://fishtest.masiutin.net/tests/view/643e92515af72ac7dd0a68d8?show_task=0
What are you doing?! That’s not even the fishtest limit for LTC SMP tests! The test you just started will go on for weeks
@Disservin wrote:
That’s not even the fishtest limit for LTC SMP tests!
The test will probably finish faster, yielding results (positive or negative) sooner than that limit, I hope.
The test you just started will go on for weeks
Maybe I will manage to find a few more workers.
A bit OT: What is the shortest reasonable TC on fishtest? I guess 3+0.03s should be OK. Is this correct?
I'm using cutechess-cli with -tb
parameter to give cutechess the Syzygy tablebases for adjudication; however, the engines are not using the tablebases. They just help cutechess-cli decide the winner faster without losing time on additional moves. TCEC uses this approach when the tablebases are used by cutechess-cli (their custom version) but not by the engines.
@maximmasiutin i don’t get why you choose different time control/engine settings at all. If you are on fishtest and select LTC SMP you get 8 threads and 20+0.2, so I don’t get why you chose 16 threads and 60+0.6?!
yea 10s 1 thread much better choice
well the point from him was
Why do benchmarks with multiple threads show no benefits while cutechess-cli with the same number of threads show benefit? while now reading that again does really make sense imo, 4vs4 threads is still multiple threads and the same number of threads?
@maximmasiutin i don’t get why you choose different time control/engine settings at all. If you are on fishtest and select LTC SMP you get 8 threads and 20+0.2, so I don’t get why you chose 16 threads and 60+0.6?!
Suppose a CPU has 16 threads in total. In such case what would be the most realistic use scenario of StockFish? I guess, Threads=16; Ponder=off. That's what I configured. I could also 20+0.2, but I saw people are using 60+0.6 at the main instance.
fishtest operates under very different conditions than a typical user would use stockfish
but not for smp....it's also not about the most realistic and not about the hardware specs of the worker... the max is 8 and the min is also 8 for stockfish smp tests
Are we developing for the user or for the fishtest?
you want us to start testing at 32 threads and 1 hour per move?
Patches that get merged into stockfish follow fishtest rules (mostly) there are very few exceptions and this is not really one imo.
you want us to start testing at 32 threads and 1 hour per move?
You know better, but my feeling that to test algorithm logic, current defaults of the fishtest main instance are the best. But since the fishtest main instance cannot at the moment be used to test on particular architectures, than we should not try to take the defaults to test algorithm logic and apply them to test vnni512-specifics.
With fishtest rules i am refering to the default settings for the tests.
SMP being for multithreaded tests
They should be used for all kinds of tests, and your vnni tests should also use them since the result is comparable and a longer tc/ even higher thread count is not important.
The present issue (wish) https://github.com/glinscott/fishtest/issues/1611 is to allow to test on particular architectures, like only on workers with vnni512. We cannot do that at the moment. I don't dispute the correctness of the current default values for the tests that alter the logic of the algorithms. But if we had an oportunity to test on particular architectures, we could test code modifications that do not alter logic of the algorithms, but just implement them differently - the inputs and the outputs are the same, but the underlying CPU instructions are different. That's what I propose to implement in Fishtest: an option for a test to limit to particular ARCH
s.
Yes, that wish might be totally valid. I was just trying to say that the test you are running on your instance right now, will a) take forever b) doesn’t use the settings that we currently use to accept a test. Which would be 20+0.2 and 8 Threads for Multithreaded Tests.
There is currently no reason to think that bench
is somehow insufficient for measuring speedups.
@Disservin wrote
A good way to proof various speedup hypothesis is to have an option for Fishtest to limit the test to certain architectures, like avxvnni, avx512, vnni512 and vnni256. There may be cases where simple benchmarking would not give result, i.e. in multithreading the execution ports are shared but threads that do different workloads may coexist, i.e. one thread does vpdpbusd while the other does integer arithmetic and conditional branching.
In this case, the code that forces bmi2 for avxvnni and vnni512 capable machines should be removed.