glinscott / leela-chess

**MOVED TO https://github.com/LeelaChessZero/leela-chess ** A chess adaption of GCP's Leela Zero
http://lczero.org
GNU General Public License v3.0
759 stars 298 forks source link

Gating matches' results shouldn't be used for strength progression graph #184

Closed sf-x closed 6 years ago

sf-x commented 6 years ago

Wat r u doing @glinscott ! Stahp! The "pick the good results only" is the textbook example of how you shouldn't do statistics. The result is biased! Even in the absence of progress, some parameter sets will win by chance, giving an illusion of progress.

killerducky commented 6 years ago

Google did this in their original AGZ paper, so I don't think it can be too bad.

sf-x commented 6 years ago

Being Google is not an excuse. Copying Google is not an excuse.

kiudee commented 6 years ago

For the progression chart only the general shape matters. The actual ELO number is incomparable anyway. Calibrated ELO results will be provided by all the volunteers running gauntlets.

The problem with additional matches for estimating ELO is that it diverts computing power away from self-play games, which are most important for improving the strength of LCZero.

sf-x commented 6 years ago

The actual ELO number is incomparable anyway.

But the differences should be comparable. Now they're not.

isty2e commented 6 years ago

The self-play rating is inaccurate regardless of this. This approach is a compromise between AGZ gating (55% at 400 games) and AZ approach (no gating), so that networks are promoted unless underperforming. For the real world Elo ratings, already people are running gauntlet and stuff.

sf-x commented 6 years ago

The self-play rating is inaccurate

I reject this claim. It's accurate for Stockfish; LC0 doesn't play like a monkey anymore, why shouldn't it be accurate?

isty2e commented 6 years ago

Because of the self-bias. It has been readily observed for AGZ and LZ, for example.

sf-x commented 6 years ago

For AGZ, isn't it the very factor I described in the first comment, rather than any "self-bias"?

kiudee commented 6 years ago

@sf-x If you can think of a way of accurately estimating the self-play ELO while keeping the number of matches at a minimum, we are happy to accept pull requests.

edit: The move to SPRT based gating is currently worked on in #174. This should further reduce the number of matches.

isty2e commented 6 years ago

Even if you are cherry picking, still the error in the accumulated rating is expected to satisfy Chernoff bound, for example. Newer nets are, by their nature, more or less specialized at defeating previous versions.

zz4032 commented 6 years ago

That about "biased match results for progression chart" was clear from the start (for me and I guess for many). It's just not that relevant in this project (and Leela Zero too) because promotion of bad network doesn't hurt and maybe even helps (nobody seems to know exactly) and watching Elo grow is more for fun than for any statistical use. If one wants to keep up with precise measurement, then in case of a successful promotion the match should be replayed with a fixed number of games. But the resources are better spent for training games.

sf-x commented 6 years ago

watching Elo grow is more for fun

It should renamed from Elo to bullshit then. "The rating of LC0 is 3705 bullshits", hmmm...

zz4032 commented 6 years ago

Compared to more precise rating measurements the correlation of the progression chart is 0.85x+1266 Elo. User Uriopass has posted a graph in the discord channel once. That's enough information for the purpose that the graph has.

jkormu commented 6 years ago

Assume that the progress in strength would had stopped and each new network would be equally strong as the last one. When we match these there is 50% change that the new net beats the previous. The elo delta is drawn from normal distribution. If the delta is negative the graph is not updated, if it is positive we update the graph. You would see steady progress in the graph till the end of time. It might be fun to watch but it does not tell much about the actual progress in the strength. And it makes it hard to see when the progress stops and is time to move to bigger net.

luigio commented 6 years ago

Could a correction factor be applied to account for this bias? Is there any formula for that?

Also, would we not have the same problem with gating at 55%, as in AGZero? Or does SPRT solve that?

jkormu commented 6 years ago

In principle, the same effect would be there with 55% gating as well but at least majority of the false jumps would be filtered out.

jkormu commented 6 years ago

Assuming each new net is equally strong to last, each match is 400 games and draw probability is 8%, then in long run with 50% gating we would see the graph climb towards infinity at average rate of +6.6 elo/match.

Dorus commented 6 years ago

Based on LZ, i can tell you self play rating was 2-3x inflated, started at like -2400 instead of 0, and new networks, despite scoring 55% or better vs their predecessor sometimes had lower elo on other testruns vs other ai's.

However, the general shape of the self play graph did align nicely with the general shape of the elo progress recorded elsewhere. It's true that self play vs just the most recent net is a quick and dirty way to estimate elo, but then 400 match games on any tournament is still going to give you +/- 30 elo or so anyway, and this inaccuracy will stack for every new net.

To get a more accurate elo graph you need to match the network vs an established opponent, and none of the LZC networks are established. The easiest way i can think of is to establish the elo of a limited number of LCZ nets by running a large nr of tournament games vs other known bots or player, and draw the elo of the remaining nets based on the self play rating they have between 2 established nets.

jkiliani commented 6 years ago

The point of the progress graph is not to establish accurate Elo ratings for comparison to humans or outside engines. The point is simply to provide a rough way of measuring progress, giving feedback when the network is stalling or regressing. There's absolutely nothing wrong with how things are handled right now in my opinion. When it comes to the point that there is an actual stall, we have plenty of ways to detect that and react to it by changing training parameters or bootstrapping a bigger net. Getting more accurate strength measurement at the expense of using that compute to actually make the engine stronger with self-play is a bad exchange in my estimation, so we have to be somewhat economical with matches.

Error323 commented 6 years ago

I fully agree with this statement by @Dorus. In fact we've seen that very narrow selfplay elo improvements showed much better improvements in tournament based elo ratings and vise versa.

From this, I would conclude that we should not perform gating the way we do now. We're throwing away valuable nets with good information contained in them and the positional variety they bring in our window size had simulations been performed with them. In contrast we're running more simulations on our current "best" net which is, I think, sub-optimal for that reason.

Therefore I would like to suggest we only employ gating with absurd drops, introduced by bugs (I don't think absurd drops will occur otherwise). E.g. whenever we drop > x elo from our max rating, we ignore the net. ~Or we change our matching algorithm to incorporate a real tournament using a good set of chess engines with carefully calibrated elo ratings.~

Edit: What @luigio said below.

luigio commented 6 years ago

Or we change our matching algorithm to incorporate a real tournament using a good set of chess engines with carefully calibrated elo ratings.

That would indirectly introduce "non-zero-like" information in the nets, since bots with human knowledge would decide which nets are valid.

asymptomatic-tomato commented 6 years ago

Just chiming in to say that I'm also in favor of promoting all nets except the obviously bugged ones (which ideally we shouldn't be seeing much of anyway). It's more likely to help than harm things, anyway.

CMCanavessi commented 6 years ago

+1 vote for promoting everything

jkormu commented 6 years ago

There are very good arguments to always promote, it also solves the problem with the graph. If the actual progress stalls then the graphs stalls, which is not the case with gating.

zz4032 commented 6 years ago

+1 vote for always promoting Endgame performance graphs show substantial improvements even for networks that "failed", so they can't be that bad. We now had selective promoting for a while, let's test always promoting for some weeks and compare later.

jkiliani commented 6 years ago

Promoting everything should be fine, but I think "promoting everything that's not bugged with a high likelihood" would probably be better in practise. I'd set a reasonable minimum bar for match performance, maybe 30% winrate against the best network? Note that this should change how promotion works, the new net would then be used for self play, but not become the new standard for evaluating further networks against. Nets should always be evaluated against the strongest measured so far.

MathAndreas commented 6 years ago

+1 vote for always promoting, also for the sake of the scientific experiment.