Closed Diapa closed 6 years ago
The direct estimation of the expected winrate π(s) from the board state s, is more accurate than the two-stage indirect mapping from s to E[score] and E[score] to π(s). Therefore such an approach is expected to be weaker than the current direct approach. One might consider using the probability distribution of scores instead of its expectation value, but then it would be more demanding to generate the training data and train the NN. If you are interested in the approximate score estimation of a board state, the dynamic komi approach with a sufficiently monotonic NN might be helpful.
If someone ever did set up a neural network to predict probability distributions, that would allow some neat things to be done during the tree search.
I suggest some sort of test be done, like the 400 SPRT tests(between the same network, winrate vs score). https://github.com/alreadydone has already done some code which does something similar through moving the goalpost of komi depending on the current score. It has some problems like not being able to have a komi less than 0. I talked about maximizing score instead of winning here(maximizing score would win if possible in perfect conditions): https://github.com/gcp/leela-zero/issues/1515
I am very strongly in favour of making the network estimate probability distributions of scores. The current "value head" is doing this for a special case, something like Pr(black is >= 7.5 ahead of white).
It's not clear just what form this should take, though.
The network could estimate, for some range of scores, Pr(B ends up ahead by >= s points). But then the result might turn out nonmonotonic, which is of course impossible, and it's not clear what that should be taken to mean.
The network could estimate, for every possible score, Pr(B ends up ahead by exactly s points). But that feels like too "narrow" a thing to be estimating and I would expect various pathologies.
The network could estimate the parameters in some lower-dimensional set of "typical" probability distributions. E.g., mean and standard deviation. But most likely there will be many situations where whatever family of probability distributions is chosen won't fit well. (E.g., if it predicts just mean and standard deviation, it will have trouble representing situations where there's a big group whose fate hangs in the balance and the result will either be near +30 or near -30.) Just predicting the score is (kinda) the special case where we don't even bother with the standard deviation.
Another thing to consider, which isn't quite the same, is for the network to estimate Pr(B wins | komi = x) for various values of x. This is different from Pr(B ends up ahead by > x) because the best way to play depends on the komi value. The resulting values would no longer (quite) form a meaningful probability distribution. But it might avoid needing komi as an input. (I don't think it would, though, because the policy distribution should depend on komi.)
In view of the "dynamic komi" experiments of @alreadydone, perhaps we shouldn't actually call the thing "komi"; call it something like "winning threshold". (Because the idea of "dynamic komi" is to tell LZ that what it's trying to do is to get the score above some threshold. "Normally" this equals the komi, or minus the komi, but sometimes a different value is better to stop it playing stupid moves.)
(Er, it occurs to me that the foregoing may give the impression that I am, or think I am, some sort of computer-go expert; I am purely a dilettante and what I am strongly in favour of neither has nor should have all that much effect on anyone :-).)
The score distribution can be modelled by a Dirichlet mixture of Beta distributions, for example. The real problem is what to feed the NN with, i.e. what the training data would be. It is desirable that a network is trained with actual game results instead of tree search results, but I reckon that it will be very cumbersome and costly to generate such data for every board state.
I am very strongly in favour of making the network estimate probability distributions of scores. The current "value head" is doing this for a special case, something like Pr(black is >= 7.5 ahead of white).
The following paper describes an architecture where a value network ouput a distribution of win rate over different komi values, what amounts to say over different scores.
@gjm11: does-it correspond to what you have in mind ?
At least, seems the closest reference I know related to taking komi into account on the output side. Contrary to @alreadydone's solution very nice trick on the input side, it needs a special learning. But it is also reported to increase the network strength, probably by allowing more information to flow backwards into the value network ( need to predict correct outcome for 7.5 komi, but also for 6.5, 5,5, etc...., for 41 different values of komi).
Yes, that is very much the sort of thing I have in mind.
If modifications are made to make LZ greed for more points, I think it would be nice to make it keep the game as short as possible as a secondary goal. Even now, the unmodified LZ sometimes starts unnecessary ko fights or makes ko fights longer than needed. So when it always tries to get more points, you can easily imagine a situation where LZ starts a 50-move sequence with multiple kos just to win by 152 points instead of 150. This would be quite annoying for human opponents and spectators.
I support @betterworld 's idea to make games shorter I remember complaining about LeelaZero making games unnecessarily long, and was shot down because dumbpass mode is somehow important(for reasons I do not understand).
My modified idea is to weight the training data. e.g. say a game completed in M moves. Then the game could have a weight of Min(0, 1-M/300). Thus, over time, the network will learn to prefer shorter games.
The winning percentage is used to win the game.But using points will make it unstable. Because it will become more militant to get more points.
Keeping the games short conflicts with LZ's primary goal of maximizing its winrate. There's currently no way to predict game length, so you're basically asking to re-train LZ with an additional prediction as output. Then you would sometimes compromise winrate in favor of keeping games short. How exactly?
The goal of this project from the start has been to make LZ as strong as possible. Compromising that goal would make a lot of people unhappy.
I don't even understand the suggestion.
If the game is close, LZ could need these kind of sequences to win. If the game is not close at all, the human can resign.
If you just want it to play more aggressive, i think the dynamic komi leela can fill that purpose if i'm not mistaken.
Come on, we must've covered this dozens of times:
a) MCTS is significantly stronger if its target is the goal of the game (which is TO WIN). Points don't matter and trying to optimize them (instead of winning) leads to suboptimal play, which is never what you want.
b) If for some aesthetic or analysis reason you want to pretend points matter, this can be done by shifting the goalposts (dynamic komi) and trying to WIN the new game.
The goal of this project from the start has been to make LZ as strong as possible.
It's strong enough now that making it more useful for analysis is good. But using score in MCTS does not achieve those goals. It just makes it worse.
It's strong enough now that making it more useful for analysis is good. But using score in MCTS does not achieve those goals. It just makes it worse.
I don't agree, but you seem so determined I don't want to argue.
I don't agree, but you seem so determined I don't want to argue
You didn't really explained your point. Why would maximising the score would be better ? More score for a move doesn't necessarily mean better chance to win the game, and winning is what matter. If you don't want better winrate analysis but more aggressive review, dynamic komi seems to do the trick.
Advantages
- Can play handicap well since she would just try to loose by as few points as possible. When opponent makes mistake she would punish it as hard as she can, but no trickplays/overplays.
- Also reverse komi is no problem, you wouldn't even have to define the komi. I can try to just loose by as few points as possible. Finally LZs thought on the eternal question (which komi is best) will be known (by just looking at blacks lead on the empty board).
- She would always play endgame her best. This makes a better playing experience, and since the endgame is good in selfplay the policy network wouldn't learn "endgame doesn't matter" all the time.
- When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem", "this endgame move was exactly 2 points worse than this one.. yeah I see why".
- We would get kind of the same insights as from Environmental Go https://senseis.xmp.net/?EnvironmentalGo but more precise (for example increases in the value of sente can be observed).
I think I explained my point quite a bit which part would you like to know more about?
I don't agree
You don't agree with what?
You can shift komi around and then try to win the new game that results. I am not clear what advantages your method would have over this, but I know the disadvantage is that it leads to weaker play.
On top of that, some of the advantages you list I fundamentally disagree with:
She would always play endgame her best.
This isn't true. She would try to maximize the score, potentially at the cost of allowing the opponent counter-play and maybe losing. That's not "best". Fan Hui explains this in the Alpha Go movie.
If it were best then we would've have concluded winrate leads to stronger programs many years ago.
When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem"
I am not sure if you can expect such an exact or meaningful mapping to come out.
Can play handicap well since she would just try to loose by as few points as possible....but no trickplays/overplays.
I have no opinion here but people have argued both ways that this is the wrong way to play handicap.
Dynamic komi is more flexible since you can shift the program goal by shifting the komi. So, again, what advantages would playing for territory give?
When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem", "this endgame move was exactly 2 points worse than this one.. yeah I see why".
This point show that you didn't listen to GCP : you think you will have a better analysis with score, but it's not true, since GCP explained that the net will play noticably weaker. You won't have a better analysis of the position, you will have an analysis that you "think" is better, but it will not be. If there would be end-game moves wich would maximize score and would be better that what leela play now, she would discover them during training, and she would play them. But they are NOT better move. They are just moves that will maybe give more point WHEN the game end in a win, but these moves will also result in winning less games overall.
Edit - GCP was faster to respond
Let me explain it this way: we try with komi set to 7.5, and have 45% winrate for white. We set komi to 8.5, and now get 55% winrate for white. This means white must be about 1 point behind (incl original 7.5 komi).
This means that if we can get winrates and can vary komi, we can get the score information.
But I'm not sure we can go the other way around? If the network would give the position a "white up 0.5 stone", then we can't translate that to a fractional winning chance, can we? The meaning in winning percentage would be pretty different if we were in deep yose vs it being move 2.
I am very strongly in favour of making the network estimate probability distributions of scores.
This is a better approach because it avoids some of the problems I mentioned above. Some academic teams are working on this (based on Leela Zero).
And note it's similar to the current komi planes work. By sliding the komi setting there, you exactly get such a distribution.
If she is strong enough (I think so) she knows what is an overplay in endgame and what is not. And if a play has a 50% chance to win 3 points and 50% to lose 30 she still wouldn't do it (on average a loss).
I don't think a strong bot can even learn to do trickplays since it is as much a mistake to it as any other mistakes, it doesn't know what tricks the opponent. Therefore the only way is to not make it do trickplay as strength increases. Current Leela thinks first line is the best trickplay :)
This point show that you didn't listen to GCP : you think you will have a better analysis with score, but it's not true, since GCP explained that the net will play noticably weaker.
I try my very best to listen! And LZ can't be noticably weaker, not to me, I couldn't even tell if Leela, LZ, AlphoGo or Go Seigen is stronger from their games.
If dynamic komi is so precise you can set komi to 8.5 or any other value then I am sorry this is almost as good and should just be implemented into Lizzie.
And if a play has a 50% chance to win 3 points and 50% to lose 30 she still wouldn't do it (on average a loss).
The problem is that if it's a 2% chance to win by 100 points and a 98% chance to lose by 1 point, she would try to win by 100 points. Apparently that happens often enough that people who did MCTS research easily noticed the strength difference.
ok I get it.
The problem is that if it's a 2% chance to win by 100 points and a 98% chance to lose by 1 point, she would try to win by 100 points.
But wait the current LZ would also do that! She would shoot for the small chance to win by a lot. But I get it you just meant the other way around.
(I still think my idea is better but since people don't seem to agree we can forget it)
As I explained earlier, to estimate (and therefore to maximize) winrate, it is always more accurate to estimate the winrate directly. The two-step mapping, i.e. board state->expectation of score->winrate is trivially less accurate due to the loss of information.
the point wasn't to make winrate more accurate, it was to also get information about the score.
With a probability distribution on win rates it can open up some interesting options in the tree search. Can the dynamic komi tricks be used to estimate distirubtuions in such a way?
@gcp: although I agree with you (what amounts to say that I trust you in fact, as I'm no expert ;-) and @alreadydone proved it workable, I wondering about your comment:
And note it's similar to the current komi planes work. By sliding the komi setting there, you exactly get such a distribution.
I think it is not really equivalent. Learning a distribution of win rate over komi (training the ouput) would necessarily be done with monotonical training data, leading to the expected montonicity, even over high komi values. While using the 'input komi plane trick' (tweaking the input) is free-lunch in terms of training, you have to take the network as it is, meaning there is no guarantee that this approach will work. ELF win rate distribution vs komi input plane turned out to be too sharp/non-monotonic to be used. And not all versions of LZ were smooth/monotonic enough.
The problem is that if it's a 2% chance to win by 100 points and a 98% chance to lose by 1 point, she would try to win by 100 points.
But wait the current LZ would also do that! She would shoot for the small chance to win by a lot. But I get it you just meant the other way around.
A better example is that if you use the expected score as evaluation during search, it will prefer {5% winning by 20.5 points, 95% losing by 0.5 points} over 100% winning by 0.5 points.
This should be reconsidered as it has been shown to speed up early training https://arxiv.org/pdf/1902.10565.pdf which I think mostly removes the point that it would weaken the AI.
I think it would be better to replace "winrate" by "points ahead/behind". Maybe it would be possible to create such a neural network from current Leela Zero self play? and then replace the value network with this/run both at the same time in some smart way. Otherwise these are ideas for a future run or a similar project.
Advantages
Disadvantages
What do you think about it? Is there something I havn't thought of? Has this been done before?