leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.33k stars 1.01k forks source link

Are there any ways to calculate the rating difference between AlphaGo Zero and Leela Zero? #2576

Open y-ich opened 4 years ago

y-ich commented 4 years ago

Hi.

First of all, thank the team for this great project!

Now I am interested in whether Leela Zero have exceeded AlphaGo Zero or not yet. I wonder it may be possible to know it if you analyze AlphaGo Zero vs. AlphaGo Master games by Leela Zero.

What do you guys think?

Thanks.

nemja commented 4 years ago

Few metrics exist to compare:

l1t1 commented 4 years ago

ask deepmind for a binary file... :)

lcwd2 commented 4 years ago

The following is an estimate based on the AGZ elo curve. It might not be accurate.

AGZ elo max (5185) - min (-2244 approx.) = 7429 AGZ has 269 upgrades first AGZ long upgrade (took over 600K games) occurred at AGZ 263 there is an elo increase of 206 between AGZ 263 and AGZ 269 5.6m games were generated between AGZ 263 and AGZ 269

LZ has 262 upgrades LZ elo if we compare networks that are 10 generations apart = 6897 LZ elo if we compare networks that are 65% to 95% apart = 8903 first LZ long upgrade (took over 600K games) occurred at LZ 254 there is no net elo increase between LZ 254 and LZ 262 1.4m games were generated between LZ 254 and LZ 262

I tend to believe that LZ is still 200 to 500 elo below AGZ. LZ seems to be more efficient in terms of the number of games. probably due to swa.

langit7 commented 4 years ago

Lc0 chess,try to reproduce alpha chess Vs sf8. Lc0 on par with azchess

nemja commented 4 years ago

I think directly comparing Elo graphs is almost useless, mostly because the measured difference between two nets depends on the visit count used in the match (see earlier issues). IIRC AGZ used quite different visit (playout) limits than LZ, so even for the same strength difference its Elo graph will show different Elo gains.

yssaya commented 4 years ago

AlphaGo Zero's Raw Network is similar to AlphaGo Fan from AGZ paper Fig6.b. From CGOS BayesElo, LZ raw network (1 playout, LZ_258_32e8_r1_p1) is 3080 BayesElo. It is maybe around 7d on KGS. I think AlphaGo Fan is around 8d or 9d. So I think LZ is still behind 100 or 200 Elo.

                    CGOS BayesElo
Raw Network of AGZ  4002?          (my old guess)
LZ_258_32e8_r1_p1   3080
Aya798c_F32cn15_5k  2912  KGS 6d?
Aya790e_510_ro_1k   2542  KGS 3d

"Raw Network of AGZ" is my old guess. http://computer-go.org/pipermail/computer-go/2018-January/010682.html CGOS ratings are not very reliable due to many self-play results.

lcwd2 commented 4 years ago

The following is the test result on the effect of the number of visits on win% for match games. 400 games for each test. Each test is executed twice.

lz260 vs lz250

-v 1600 50.0% -v 3200 54.7% -v 4800 51.0% -v 6400 49.0% -v 1600 55.1% -v 3200 53.1% -v 4800 50.6% -v 6400 52.6%

lz160 vs lz150

-v 1600 56.3% -v 3200 54.7% -v 4800 56.7% -v 6400 56.9% -v 1600 60.8% -v 3200 58.1% -v 4800 51.9% -v 6400 59.0%

In the AGZ paper, Fig 6a (0.4s per move = 1600 visit) max agz elo is 5062 and Fig 6b (5s per move, 4 TPU, 80000 visit) agz elo is 5185. There is a difference of 123 elo between the two.

The bigger uncertainty is on the pairing of players. Some suggests that matches between players 750 elo apart (or 98.7%) should not be introduced since the result will not be accurate. In the Katago paper, matches between models 35 generations apart are paired. This led to a highly compressed elo (over 1000 elo) at the lower portion of the elo graph.

gnugo elo is 431 according to the AG paper. The AZ paper declared that 4 TPUs can do 16K visit per second. The AG team used 0.4s 1600 visit (1 TPU) for elo progress graph and 5s on 4 TPU (est. 80000 visits) for final elo matches. gnugo wins lz30 (80000 visit, 1000 games) 41.8% gnugo wins lz30 (1600 visit, 1000 games) 53.7% From the above, lz30 (1600 visit) elo is estimated to be 348.

lz elo

AlphaGo Zero (and perhaps AlphaZero) also used 1600 visit for plotting the elo graph. However, this graph cannot be used for gauging the relative performance of the algorithms due to the differences on self play games. Leela began with smaller networks so the initial games were faster. AGZ self play used 1600 visits and was augmented with 8 symmetries (although the ratio of states used in training may not be 100%). On the other hand, AZ self play only used 800 visits without symmetry and therefore a total of 150m games were needed. In Leela zero, each trainer server picks up its own set of random symmetries. The symmetries are exploited collectively by the group of training servers but individual servers do not necessary pick up all the symmetries.

The algorithms of LZ and AGZ are similar so the self play high elo bias should be similar. Nevertheless we don’t have a 2nd high elo anchor so the higher elo end is still uncertain.

y-ich commented 4 years ago

Thank you for your comments!

Still I wonder that the second direct method to know the strength is analyzing a score of the target. Of course, the first direct method is to play against the target.

B815738B-BB09-4E08-B6C4-DEC485F9F394 This is a result analyzing some score between AlphaGo Zero and Master using Leela Zero weight(261) at tens of playouts for each move. Look at the bar chart on the top of screen. This chart shows differences of estimated winrates between the one before play moves and the one after play moves. Apparently tens of playouts means a relatively weak player. So there are so many actual moves which increase winrate after played it. I think that it means that both players are stronger than the analyzer.

So I wonder that we can know whether Leela Zero is stronger than AlphaGo Zero or not if we analyze scores between Zero and Master by Leela Zero with enough playouts.