official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
282 stars 129 forks source link

Suggestion: Automatically run tests (where possible) using 2 threads instead of 1 #699

Closed xoto10 closed 4 years ago

xoto10 commented 4 years ago

I've created an issue on the stockfish github about this. https://github.com/official-stockfish/Stockfish/issues/2721

Input from people here would be welcome.

vdbergh commented 4 years ago

IMHO this should be backed by a sensitivity test. Intuitively one would think that this would just add more noise...

xoto10 commented 4 years ago

Ok, I'll look into that once the submission deadline for the TCEC superfinal has passed (2 days away unless it changes again).

My intuition is this adds useful noise, but using noob_3book (150,000 lines) the gain may be negligable. But I still think this is the right move, the main reason we can't use small opening books is because they are too deterministic and we don't get enough variety to get a useful result. e.g. protonspring's recent tests from startpos. This seems to me like a no lose / easy win way to get more variety from smaller opening books.

Edit: more variety from smaller or existing opening books

xoto10 commented 4 years ago

So how should I test this? Use a single passed patch or two, or do a regression like master vs sf11? I was going to ask whether to use a search patch or an eval patch (or one of each), but I guess search patches might impact on scaling across threads, so perhaps that's not a good idea. A regression test seemed like a good idea, but again, the search elements might affect the results.

vdbergh commented 4 years ago

I guess it would be best if the maintainers commented on this. @vondele ... ?

However the "statistical answer" would be as follows. Run two fixed length tests (e.g. 60000 games) under identical conditions except one test using 1 thread and the other using 2 threads and half the time (to consume the same amount of ressources). Then compare the sensitivity (given - with confidence interval - on the raw statistics page of each tests).

Note that sensitivity is difficult to measure accurately so it should measured with engines which are quite wide apart in Elo. E.g. SF11 against SF10.

vondele commented 4 years ago

I just added a comment on the stockfish issue. Right now, I think there are reasons not to change things. However, I would be rather interested in seeing the sensitivity test done.

Note that 2threads vs 1 thread at twice TC, will be roughly 0.95 efficiency. (from the typical scaling data https://github.com/glinscott/fishtest/wiki/UsefulData#elo-from-threading)

xoto10 commented 4 years ago

Ok. So as a suggestion, I could run 60k games sf11 vs sf10, using the noob_3moves book for more sensitivity. (Can I just enter the tags sf_10 and sf_11 in fishtest?) Should I use STC or LTC, or some intermediate value? And add 5% extra time for the th 2 test to account for the efficiency drop?

My guess is that the 2 thread case would be better with a restricted book, but with noob_3moves the difference is probably negligable since the opening book is large. Be interesting to see what happens.

vdbergh commented 4 years ago

Again expressing my personal opinion...

vondele commented 4 years ago

yes, STC is good to start. This is not completely unrelated to our normal and threaded regression test, but in that case the TC ratio is not quite the same.

Note, hash should be kept the same in this test.

xoto10 commented 4 years ago

10+0.1 th 1 and 5+0.05 th 2 tests submitted. Obviously let me know if you spot any mistakes.

xoto10 commented 4 years ago

@vondele I forgot to ask, on a vaguely related matter, would it be possible to create this book in Stockfish/books :

12openings.epd rnbqkbnr/pppppppp/8/8/2P5/8/PP1PPPPP/RNBQKBNR b KQkq - 0 1 rnbqkbnr/ppp1pppp/8/3p4/2PP4/8/PP2PPPP/RNBQKBNR b KQkq - 0 2 rnbqkb1r/pppppppp/5n2/8/3P4/5N2/PPP1PPPP/RNBQKB1R b KQkq - 2 2 rnbqkb1r/pppp1ppp/4pn2/8/2PP4/8/PP2PPPP/RNBQKBNR w KQkq - 0 3 rnbqkb1r/pppppp1p/5np1/8/2PP4/2N5/PP2PPPP/R1BQKBNR b KQkq - 1 3 rnbqkbnr/ppp2ppp/4p3/3p4/3PP3/8/PPP2PPP/RNBQKBNR w KQkq - 0 3 rnbqkbnr/pp2pppp/3p4/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 0 3 r1bqkbnr/pp1ppppp/2n5/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3 rnbqkbnr/pp1p1ppp/4p3/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 0 3 r1bqkbnr/1ppp1ppp/p1n5/1B2p3/4P3/5N2/PPPP1PPP/RNBQK2R w KQkq - 0 4 rnbqkbnr/pp1ppppp/2p5/8/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 2 rnbqkb1r/pppppppp/5n2/8/8/5N2/PPPPPPPP/RNBQKB1R w KQkq - 2 2

(I'm assuming fishtest gets the book from the Stockfish/books repo?)

This is a bit like startpos, but forces the first 1 ply / first few plies to vary across the most played options. This was used by Deep Mind in one of their papers. I don't know if I'll get to doing any tests with this book, but at least it's an option if it's up there.

xoto10 commented 4 years ago

One thing that is surprising to me. I was assuming that 10+0.1 on 1 thread would be roughly the same strength as 5+0.05 on 2 threads, maybe 1 or 2 Elo stronger because of the 5% inefficiency on 2 threads, or maybe 5 Elo at STC if the effect was exaggerated at short tc. Doing some tests at home, it looks like 2 threads is ~40 Elo weaker than 1! That's quite a drop??

vondele commented 4 years ago

so yes, I can confirm the ~40 (50) Elo difference in strength. That's actually 80% efficiency. (i.e. similar loss as playing on 8+0.08).

Concerning the book, the 'formal' way would be a PR to the books repo, and I could merge it. I'm a bit reluctant to add books, since we have no real criterion which books to add and which not, and fishtest would spent as much time on testing books as on testing contempt ;-)

xoto10 commented 4 years ago

Ok, these sensitivity tests finished.

10+0.1 th 1 : ELO: 45.94 +-1.8 (95%) LOS: 100.0% Total: 60000 W: 17083 L: 9196 D: 33721 Ptnml(0-2): 663, 5068, 12350, 9557, 2362 | 60000 https://tests.stockfishchess.org/tests/view/5eeb8c1cabad5865ae9c790e

5+0.05 th 2 : ELO: 47.66 +-1.8 (95%) LOS: 100.0% Total: 60000 W: 17651 L: 9472 D: 32877 Ptnml(0-2): 695, 5131, 12022, 9604, 2548 | 60000 https://tests.stockfishchess.org/tests/view/5eeb8c67abad5865ae9c7910

Maybe not strong enough to say 2 threads are definitely better than 1, but most likely no worse with a reasonable chance of being slightly better. I take this as another indication that using 2 threads for all tests is feasible at some point in the future, even if we don't implement it right now.

The noob_2moves book has only 7314 lines (vs 151k in noob_3moves), is there interest in doing the same test with that book?