Closed sf-x closed 6 years ago
Google did this in their original AGZ paper, so I don't think it can be too bad.
Being Google is not an excuse. Copying Google is not an excuse.
For the progression chart only the general shape matters. The actual ELO number is incomparable anyway. Calibrated ELO results will be provided by all the volunteers running gauntlets.
The problem with additional matches for estimating ELO is that it diverts computing power away from self-play games, which are most important for improving the strength of LCZero.
The actual ELO number is incomparable anyway.
But the differences should be comparable. Now they're not.
The self-play rating is inaccurate regardless of this. This approach is a compromise between AGZ gating (55% at 400 games) and AZ approach (no gating), so that networks are promoted unless underperforming. For the real world Elo ratings, already people are running gauntlet and stuff.
The self-play rating is inaccurate
I reject this claim. It's accurate for Stockfish; LC0 doesn't play like a monkey anymore, why shouldn't it be accurate?
Because of the self-bias. It has been readily observed for AGZ and LZ, for example.
For AGZ, isn't it the very factor I described in the first comment, rather than any "self-bias"?
@sf-x If you can think of a way of accurately estimating the self-play ELO while keeping the number of matches at a minimum, we are happy to accept pull requests.
edit: The move to SPRT based gating is currently worked on in #174. This should further reduce the number of matches.
Even if you are cherry picking, still the error in the accumulated rating is expected to satisfy Chernoff bound, for example. Newer nets are, by their nature, more or less specialized at defeating previous versions.
That about "biased match results for progression chart" was clear from the start (for me and I guess for many). It's just not that relevant in this project (and Leela Zero too) because promotion of bad network doesn't hurt and maybe even helps (nobody seems to know exactly) and watching Elo grow is more for fun than for any statistical use. If one wants to keep up with precise measurement, then in case of a successful promotion the match should be replayed with a fixed number of games. But the resources are better spent for training games.
watching Elo grow is more for fun
It should renamed from Elo to bullshit then. "The rating of LC0 is 3705 bullshits", hmmm...
Compared to more precise rating measurements the correlation of the progression chart is 0.85x+1266 Elo. User Uriopass has posted a graph in the discord channel once. That's enough information for the purpose that the graph has.
Assume that the progress in strength would had stopped and each new network would be equally strong as the last one. When we match these there is 50% change that the new net beats the previous. The elo delta is drawn from normal distribution. If the delta is negative the graph is not updated, if it is positive we update the graph. You would see steady progress in the graph till the end of time. It might be fun to watch but it does not tell much about the actual progress in the strength. And it makes it hard to see when the progress stops and is time to move to bigger net.
Could a correction factor be applied to account for this bias? Is there any formula for that?
Also, would we not have the same problem with gating at 55%, as in AGZero? Or does SPRT solve that?
In principle, the same effect would be there with 55% gating as well but at least majority of the false jumps would be filtered out.
Assuming each new net is equally strong to last, each match is 400 games and draw probability is 8%, then in long run with 50% gating we would see the graph climb towards infinity at average rate of +6.6 elo/match.
Based on LZ, i can tell you self play rating was 2-3x inflated, started at like -2400 instead of 0, and new networks, despite scoring 55% or better vs their predecessor sometimes had lower elo on other testruns vs other ai's.
However, the general shape of the self play graph did align nicely with the general shape of the elo progress recorded elsewhere. It's true that self play vs just the most recent net is a quick and dirty way to estimate elo, but then 400 match games on any tournament is still going to give you +/- 30 elo or so anyway, and this inaccuracy will stack for every new net.
To get a more accurate elo graph you need to match the network vs an established opponent, and none of the LZC networks are established. The easiest way i can think of is to establish the elo of a limited number of LCZ nets by running a large nr of tournament games vs other known bots or player, and draw the elo of the remaining nets based on the self play rating they have between 2 established nets.
The point of the progress graph is not to establish accurate Elo ratings for comparison to humans or outside engines. The point is simply to provide a rough way of measuring progress, giving feedback when the network is stalling or regressing. There's absolutely nothing wrong with how things are handled right now in my opinion. When it comes to the point that there is an actual stall, we have plenty of ways to detect that and react to it by changing training parameters or bootstrapping a bigger net. Getting more accurate strength measurement at the expense of using that compute to actually make the engine stronger with self-play is a bad exchange in my estimation, so we have to be somewhat economical with matches.
I fully agree with this statement by @Dorus. In fact we've seen that very narrow selfplay elo improvements showed much better improvements in tournament based elo ratings and vise versa.
From this, I would conclude that we should not perform gating the way we do now. We're throwing away valuable nets with good information contained in them and the positional variety they bring in our window size had simulations been performed with them. In contrast we're running more simulations on our current "best" net which is, I think, sub-optimal for that reason.
Therefore I would like to suggest we only employ gating with absurd drops, introduced by bugs (I don't think absurd drops will occur otherwise). E.g. whenever we drop > x elo from our max rating, we ignore the net. ~Or we change our matching algorithm to incorporate a real tournament using a good set of chess engines with carefully calibrated elo ratings.~
Edit: What @luigio said below.
Or we change our matching algorithm to incorporate a real tournament using a good set of chess engines with carefully calibrated elo ratings.
That would indirectly introduce "non-zero-like" information in the nets, since bots with human knowledge would decide which nets are valid.
Just chiming in to say that I'm also in favor of promoting all nets except the obviously bugged ones (which ideally we shouldn't be seeing much of anyway). It's more likely to help than harm things, anyway.
+1 vote for promoting everything
There are very good arguments to always promote, it also solves the problem with the graph. If the actual progress stalls then the graphs stalls, which is not the case with gating.
+1 vote for always promoting Endgame performance graphs show substantial improvements even for networks that "failed", so they can't be that bad. We now had selective promoting for a while, let's test always promoting for some weeks and compare later.
Promoting everything should be fine, but I think "promoting everything that's not bugged with a high likelihood" would probably be better in practise. I'd set a reasonable minimum bar for match performance, maybe 30% winrate against the best network? Note that this should change how promotion works, the new net would then be used for self play, but not become the new standard for evaluating further networks against. Nets should always be evaluated against the strongest measured so far.
+1 vote for always promoting, also for the sake of the scientific experiment.
Wat r u doing @glinscott ! Stahp! The "pick the good results only" is the textbook example of how you shouldn't do statistics. The result is biased! Even in the absence of progress, some parameter sets will win by chance, giving an illusion of progress.