Closed vondele closed 4 years ago
@vondele I am rather doubtful that such a procedure is theoretically correct. Even with two engines that are equal in strength, if we keep adding games, we will eventually get a significant result (which would be devoid of meaning).
What you are looking for is a sequential procedure for ranking engines. In the past I have thought about this but I could not find anything in the literature that seemed applicable. But this probably means that I did not know what to look for.
I have given up on ranking N engines. But I can decide for each of the N (N - 1) / 2 pairs which engine is stronger (except for the corner case of engines identical in strength, which I will ignore). What I would like is a process which makes sure that each of the N (N - 1) / 2 questions (is i stronger than j) can be answered with the same confidence... Isn't the above coming close?
@vondele To be honest I am not sure. One could try to simulate the procedure and see if it gives meaningful results.
I won't implement this. Producing a LOS matrix (using a sound model like BayesElo) is well beyond the scope of this project. I want to keep c-chess-cli simple and focused.
If it can help, I can produce different output, for example CSV results or whatever can be useful, that another program can use as input to produce advanced statistics.
I think what @vondele is looking for is a generalization of SPRT test to n>1 opponents, where each data point is a gauntlet (1 round, possibly 2 games per encounter with color reversed). That would be theoretically sound and achieve the desired result.
OK, closing
this is a suggestion for a feature, and an idea triggered by conversation in https://github.com/lucasart/c-chess-cli/issues/30
What I'm doing now quite often (with nnue) is to run tournaments of multiple different version of engines, and I'd like to figure out a ranking, or more properly a matrix LOS coefficients. Having that matrix with entries either 0 or 100 % helps picking stronger versions. Neither gauntlets not round-robin are efficient in getting the matrix to its final result, it matters to play more when the engines are similar in strength. So, with multiple rounds in the tournament, each next round could pick that engine pair where the LOS is closest to 50% (or least converged to 0% or 100%). It would naturally lead to few games between engines far apart in strength and many games for engines that are close.