pytorch / ELF

ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation
Other
3.37k stars 567 forks source link

Could the ineffectiveness of black's rollouts be an artifact? #139

Open alreadydone opened 5 years ago

alreadydone commented 5 years ago

From the paper:

Intuitively, increasing the number of MCTS iterations (rollout count) improves the AI’s strength by exploring more of the game tree. Toward better understanding the rollout count’s impact on strength, we perform a selfplay analysis with the final model, in which one player uses twice as many MCTS rollouts as the other. We perform this analysis across a wide range of rollout counts (800-25600). From the results shown in Fig. 9, we find that ELF OpenGo consistently enjoys an 80-90% win rate (≈ 250-400 additional ELO) from doubling the number of rollouts as the white player. On the other hand, as the black player, ELF OpenGo only enjoys a 55-75% win rate (≈ 35-200 additional ELO). Moreover, the incremental benefit for black of doubling rollouts shrinks to nearly 50% as the rollout count is increased, suggesting that our model has a skill ceiling with respect to rollout count as the black player. That this skill ceiling is not present as the white player suggests that a 7.5 komi (white score bonus) can be quite significant for black.

I wonder whether this behavior could be related to the way training data is sampled:

Fig. 3(b) shows the progression of the policy and value losses. Note the initial dip in the value loss. This is due to the model overestimating the white win rate, causing the black player to resign prematurely, which reduces the diversity of the games in the replay buffer. This could result in a negative feedback loop and overfitting. ELF OpenGo automatically corrects for this by evenly sampling black-win and white-win games. With this diverse (qualitatively, “healthy”) set of replays, the value loss recovers and stays constant thoroughout the training, showing there is always new information to learn from the replay buffer. (BTW: typo thoroughout)

Theoretical analysis of the effect of ELF's sampling scheme

Let Q = P( black wins | position ) = the actual black winrate at a particular position Let E = the trained black winrate at the position with ELF's sampling scheme Then E can be calculated as follows: Let S be the sample size, so there are S/2 samples that are won by black and S/2 won by white. So there are S/2 x P( position | black wins ) samples with that position that are won by black, and S/2 x P( position | white wins ) samples with that position that are won by white. The total number of samples with that position is the sum. Note that P( position | black wins ) = P( black wins | position ) x P(position) / P(black wins) = P(pos) x Q / pB P( position | white wins ) = P( white wins | position ) x P(position) / P(white wins) = P(pos) x (1-Q) / (1-pB) if we denote P(black wins), the overall black winrate, by pB. We immediately see that the percentage of samples that are won by black among samples with that position is E = (Q/pB) / (Q/pB + (1-Q)/(1-pB)) = Q / (Q + r(1-Q)) = 1 / (1 + r(1/Q-1)) (where r = pB/pW = pB/(1-pB) = 1/pW - 1), which can be written in the more symmetric form 1/E - 1 = r(1/Q - 1) or (1/pB - 1) (1/E - 1) = 1/Q - 1.

So the effect this sampling scheme is to transform Q into E by the formula. Note that in the tree search, we always pick the move with the largest Q+U, which is now replaced by E+U. Then the increment Δ(Q+U) is replaced by Δ(E+U) = ΔE+ΔU = (ΔE/ΔQ)ΔQ+ΔU. It's common to face a decision between two moves of close winrates, so we assume that ΔQ is small and thus ΔE/ΔQ can be approximated by dE/dQ, so ΔE+ΔU = (dE/dQ) (ΔQ + ΔU/(dE/dQ)). We see that picking the move based on E+U is the same as picking based on Q + U/(dE/dQ), i.e. replacing Q by E has the same effect as dividing c = c_puct = 1.5 by dE/dQ, and I'll call the result effective c_puct. It is easy to compute dE/dQ = r / (Q + r(1-Q))^2, which is monotonically decreasing. At Q = 0 (white wins), we have dE/dQ = 1/r > 1; at Q = 1 (black wins), we have dE/dQ = r < 1. and at the initial position (Q = pB), we have dE/dQ = 1/(4 x pB x pW).

For black to win, its winrate needs to march from Q = pB to Q = 1, and the effective c_puct would then increase from the initial value to c/r. On the other hand, if white is to win, the effective c_puct would decrease from the initial value to cr. As is well known, c_puct balances exploration and exploitation and is crucial to effective scaling of strength with rollouts. Therefore, if a higher c_puct is detrimental and a lower c_puct is beneficial to rollout-strength scaling, then black would be at a disadvantage.

As an example, the released v2 model, # 1290, produces 41.4% self-play games won by black, so pB = 0.414 and r = 0.706. At Q = 0 and Q =1, the effective c_puct differ by 1/r^2 = 2.00, which is quite significant. The penultimate model # 1499 is even more extreme at pB = 0.329.

Suggestions

Thank you for the release of the datasets, models, and the wonderful paper, and I just want to make it even better!

l1t1 commented 5 years ago

thoughtful

yuandong-tian commented 5 years ago

Very good analysis! The intuition is that a forced even distribution of black/white win implicitly uses a posterior value for the value estimate. Note that the value function in OpenGo is [-1, 1] so you might need to make some changes in your analysis.

This might also explain the auto-tuning/zig-zag effects: whenever P(blackwin) < 0.5, the posterior P(position | black_win) is larger than it should be, which could cause white to resign on more games and reverse this trend (i.e., negative feedback).

An evaluation test can be done by using player-sensitive correction (which is a multiplication of the current Q value by a constant). This should be easy to do. On the other hand, if you convert it to p_uct, it would be quite nonlinear.

(I feel that I was reading a thermodynamics paper :D)

yuandong-tian commented 5 years ago

On the other hand, I am not sure whether this can affect black rollout efficiency. It might be the case that since black's Q is biased, more rollouts do not give the correct estimation of V (essentially V^pi) and thus more rollouts do not help.

alreadydone commented 5 years ago

I'm glad you find this analysis interesting! The choice of letters is a bit funny in hindsight :)

The negative feedback through resignation is a good point and it likely has a huge impact; I never thought of that.

I don't think the correction needs to be player-sensitive; it's as easy as inverting the one-to-one transformation Q->E to get Q = rE / (rE + 1-E). You can't quite do the correction by multiplying a constant, since dQ/dE varies as E (or Q). (With Q and E in [-1,1], the formula becomes Q = (r(1+E) - (1-E))/(r(1+E) + (1-E)).)

I think the incorrectness of winrate alone cannot explain the black rollout inefficiency. Black and white's winrates are affected equally by the sampling scheme; when Q is replaced by E, 1-Q is also replaced by 1-E. I chose to use black's winrate only for convenience; the winrate is as incorrect for black as it is for white. The interval [0,1] is simply transformed one-to-one onto [0,1], extended (dE/dQ > 1) near 0, and compressed (dE/dQ < 1) near 1, which can be roughly equivalently thought of making c_puct a function of Q. The only disparity between black and white is that when black is leading the winrate gets compressed (or c_puct gets bigger), and when white is leading the winrate gets extended (or c_puct gets smaller). The higher c_puct could be unfavorable for rollout efficiency (which is only my hypothesis) and lower the strength gap between black with higher rollouts and white with lower rollouts, leading to black failing to convert the favorable position into a win. The ratio of maximum extension and maximum compression is 1/r^2 which is independent of whether I choose Q to lie in [0,1] or [-1,1].

yuandong-tian commented 5 years ago

ah, ic. Yes, the final trained value function is actually not a proper posterior. So the transformation is nonlinear and thus is Q (or E) dependent. Applying the transformation back would yield the correct answer (rather than using effective p_uct, which stems from local linear approximation).

yuandong-tian commented 5 years ago

The analysis on black rollout efficiency would be dependent on where the optimal p_uct is. It is hard to draw a conclusion here. We might need to correct the value function first in the evaluation and then check whether this still exists.

alreadydone commented 5 years ago

Today the results of an independent test of rollout efficiency were published on the LZ forum https://github.com/leela-zero/leela-zero/issues/2236. From the diagrams/graphs, ELF v0 and v1 (but not v2) were included in the tests, using the LZ engine presumably, and all the models (as black) with varied rollouts challenge v0 (as white). It seems that v0 and v1 also enjoyed ~250 Elo gain as black per rollout doubling. The difference between LZ and ELF engines should just be the FPU reduction (makes search a bit narrower) and a higher c_puct (0.8, equivalent to 1.6 in ELF). Not sure what can be concluded yet.

alreadydone commented 5 years ago

I submitted applications to three positions (SWE or AI) at Facebook 3 weeks ago (with the email xu56@indiana.edu) through Facebook Careers and still haven't received any response. Would be nice if someone can ping a recruiter to give me a response. Sorry for the OT.