Closed xoto10 closed 4 years ago
IMHO this should be backed by a sensitivity test. Intuitively one would think that this would just add more noise...
Ok, I'll look into that once the submission deadline for the TCEC superfinal has passed (2 days away unless it changes again).
My intuition is this adds useful noise, but using noob_3book (150,000 lines) the gain may be negligable. But I still think this is the right move, the main reason we can't use small opening books is because they are too deterministic and we don't get enough variety to get a useful result. e.g. protonspring's recent tests from startpos. This seems to me like a no lose / easy win way to get more variety from smaller opening books.
Edit: more variety from smaller or existing opening books
So how should I test this? Use a single passed patch or two, or do a regression like master vs sf11? I was going to ask whether to use a search patch or an eval patch (or one of each), but I guess search patches might impact on scaling across threads, so perhaps that's not a good idea. A regression test seemed like a good idea, but again, the search elements might affect the results.
I guess it would be best if the maintainers commented on this. @vondele ... ?
However the "statistical answer" would be as follows. Run two fixed length tests (e.g. 60000 games) under identical conditions except one test using 1 thread and the other using 2 threads and half the time (to consume the same amount of ressources). Then compare the sensitivity (given - with confidence interval - on the raw statistics page of each tests).
Note that sensitivity is difficult to measure accurately so it should measured with engines which are quite wide apart in Elo. E.g. SF11 against SF10.
I just added a comment on the stockfish issue. Right now, I think there are reasons not to change things. However, I would be rather interested in seeing the sensitivity test done.
Note that 2threads vs 1 thread at twice TC, will be roughly 0.95 efficiency. (from the typical scaling data https://github.com/glinscott/fishtest/wiki/UsefulData#elo-from-threading)
Ok. So as a suggestion, I could run 60k games sf11 vs sf10, using the noob_3moves book for more sensitivity. (Can I just enter the tags sf_10 and sf_11 in fishtest?) Should I use STC or LTC, or some intermediate value? And add 5% extra time for the th 2 test to account for the efficiency drop?
My guess is that the 2 thread case would be better with a restricted book, but with noob_3moves the difference is probably negligable since the opening book is large. Be interesting to see what happens.
Again expressing my personal opinion...
Yes you can just use the tags sf_10 and sf_11.
One should not add extra time. If concurrency loses some efficiency then this should be taken into account.
As the intention would be to do better than the current testing procedure the test should be done with noob_3moves. However it would indeed be an interesting fact if the outcome would depend on the book. This could be tested at some point also.
Personally I would do the test at STC. This is the most important case for Fishtest.
yes, STC is good to start. This is not completely unrelated to our normal and threaded regression test, but in that case the TC ratio is not quite the same.
Note, hash should be kept the same in this test.
10+0.1 th 1 and 5+0.05 th 2 tests submitted. Obviously let me know if you spot any mistakes.
@vondele I forgot to ask, on a vaguely related matter, would it be possible to create this book in Stockfish/books :
12openings.epd rnbqkbnr/pppppppp/8/8/2P5/8/PP1PPPPP/RNBQKBNR b KQkq - 0 1 rnbqkbnr/ppp1pppp/8/3p4/2PP4/8/PP2PPPP/RNBQKBNR b KQkq - 0 2 rnbqkb1r/pppppppp/5n2/8/3P4/5N2/PPP1PPPP/RNBQKB1R b KQkq - 2 2 rnbqkb1r/pppp1ppp/4pn2/8/2PP4/8/PP2PPPP/RNBQKBNR w KQkq - 0 3 rnbqkb1r/pppppp1p/5np1/8/2PP4/2N5/PP2PPPP/R1BQKBNR b KQkq - 1 3 rnbqkbnr/ppp2ppp/4p3/3p4/3PP3/8/PPP2PPP/RNBQKBNR w KQkq - 0 3 rnbqkbnr/pp2pppp/3p4/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 0 3 r1bqkbnr/pp1ppppp/2n5/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3 rnbqkbnr/pp1p1ppp/4p3/2p5/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 0 3 r1bqkbnr/1ppp1ppp/p1n5/1B2p3/4P3/5N2/PPPP1PPP/RNBQK2R w KQkq - 0 4 rnbqkbnr/pp1ppppp/2p5/8/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 2 rnbqkb1r/pppppppp/5n2/8/8/5N2/PPPPPPPP/RNBQKB1R w KQkq - 2 2
(I'm assuming fishtest gets the book from the Stockfish/books repo?)
This is a bit like startpos, but forces the first 1 ply / first few plies to vary across the most played options. This was used by Deep Mind in one of their papers. I don't know if I'll get to doing any tests with this book, but at least it's an option if it's up there.
One thing that is surprising to me. I was assuming that 10+0.1 on 1 thread would be roughly the same strength as 5+0.05 on 2 threads, maybe 1 or 2 Elo stronger because of the 5% inefficiency on 2 threads, or maybe 5 Elo at STC if the effect was exaggerated at short tc. Doing some tests at home, it looks like 2 threads is ~40 Elo weaker than 1! That's quite a drop??
so yes, I can confirm the ~40 (50) Elo difference in strength. That's actually 80% efficiency. (i.e. similar loss as playing on 8+0.08).
Concerning the book, the 'formal' way would be a PR to the books repo, and I could merge it. I'm a bit reluctant to add books, since we have no real criterion which books to add and which not, and fishtest would spent as much time on testing books as on testing contempt ;-)
Ok, these sensitivity tests finished.
10+0.1 th 1 : ELO: 45.94 +-1.8 (95%) LOS: 100.0% Total: 60000 W: 17083 L: 9196 D: 33721 Ptnml(0-2): 663, 5068, 12350, 9557, 2362 | 60000 https://tests.stockfishchess.org/tests/view/5eeb8c1cabad5865ae9c790e
5+0.05 th 2 : ELO: 47.66 +-1.8 (95%) LOS: 100.0% Total: 60000 W: 17651 L: 9472 D: 32877 Ptnml(0-2): 695, 5131, 12022, 9604, 2548 | 60000 https://tests.stockfishchess.org/tests/view/5eeb8c67abad5865ae9c7910
Maybe not strong enough to say 2 threads are definitely better than 1, but most likely no worse with a reasonable chance of being slightly better. I take this as another indication that using 2 threads for all tests is feasible at some point in the future, even if we don't implement it right now.
The noob_2moves book has only 7314 lines (vs 151k in noob_3moves), is there interest in doing the same test with that book?
I've created an issue on the stockfish github about this. https://github.com/official-stockfish/Stockfish/issues/2721
Input from people here would be welcome.