Replace winrate by score?

Diapa commented 6 years ago

I think it would be better to replace "winrate" by "points ahead/behind". Maybe it would be possible to create such a neural network from current Leela Zero self play? and then replace the value network with this/run both at the same time in some smart way. Otherwise these are ideas for a future run or a similar project.

Advantages

Can play handicap well since she would just try to loose by as few points as possible. When opponent makes mistake she would punish it as hard as she can, but no trickplays/overplays.
Also reverse komi is no problem, you wouldn't even have to define the komi. I can try to just loose by as few points as possible. Finally LZs thought on the eternal question (which komi is best) will be known (by just looking at blacks lead on the empty board).
She would always play endgame her best. This makes a better playing experience, and since the endgame is good in selfplay the policy network wouldn't learn "endgame doesn't matter" all the time.
When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem", "this endgame move was exactly 2 points worse than this one.. yeah I see why".
We would get kind of the same insights as from Environmental Go https://senseis.xmp.net/?EnvironmentalGo but more precise (for example increases in the value of sente can be observed).

Disadvantages

Maybe it will be slightly weaker since the goal is not to win, it will never play more agressive when behind. I think there is enough strength to afford this. Possible solution?: Maybe it could be possible to make the VN predict a histogram of score distribution, then the normal winrate could be read from it (how many percent is to the right of komi line), this could make it possible to choose between playing for win or score or something in between. I don't know if this is reasonably doable in the VN and search but maybe possible. I think it's good to start with first idea, and if the need is seen for something like this we could think about it.
When analyzing my game I won't get an idea about how certain LZ is about the value she gives (compared to the current system which basically only tells how confident she is). The solution above could solve this too. If there is a dragon which LZ isn't sure if it can live, there would be two "hills" in the histogram, one where it dies and one where it lives.
Slower learning/slower visits? I don't know, but I guess not searching weird moves when ahead/behind would make the policy network less confused. Maybe it is harder for the VN to learn this way of thinking compared to the other? Maybe someone else has a better guess?

What do you think about it? Is there something I havn't thought of? Has this been done before?

isty2e commented 6 years ago

The direct estimation of the expected winrate π(s) from the board state s, is more accurate than the two-stage indirect mapping from s to E[score] and E[score] to π(s). Therefore such an approach is expected to be weaker than the current direct approach. One might consider using the probability distribution of scores instead of its expectation value, but then it would be more demanding to generate the training data and train the NN. If you are interested in the approximate score estimation of a board state, the dynamic komi approach with a sufficiently monotonic NN might be helpful.

roy7 commented 6 years ago

If someone ever did set up a neural network to predict probability distributions, that would allow some neat things to be done during the tree search.

Nazgand commented 6 years ago

I suggest some sort of test be done, like the 400 SPRT tests(between the same network, winrate vs score). https://github.com/alreadydone has already done some code which does something similar through moving the goalpost of komi depending on the current score. It has some problems like not being able to have a komi less than 0. I talked about maximizing score instead of winning here(maximizing score would win if possible in perfect conditions): https://github.com/gcp/leela-zero/issues/1515

gjm11 commented 6 years ago

I am very strongly in favour of making the network estimate probability distributions of scores. The current "value head" is doing this for a special case, something like Pr(black is >= 7.5 ahead of white).

It's not clear just what form this should take, though.

The network could estimate, for some range of scores, Pr(B ends up ahead by >= s points). But then the result might turn out nonmonotonic, which is of course impossible, and it's not clear what that should be taken to mean.

The network could estimate, for every possible score, Pr(B ends up ahead by exactly s points). But that feels like too "narrow" a thing to be estimating and I would expect various pathologies.

The network could estimate the parameters in some lower-dimensional set of "typical" probability distributions. E.g., mean and standard deviation. But most likely there will be many situations where whatever family of probability distributions is chosen won't fit well. (E.g., if it predicts just mean and standard deviation, it will have trouble representing situations where there's a big group whose fate hangs in the balance and the result will either be near +30 or near -30.) Just predicting the score is (kinda) the special case where we don't even bother with the standard deviation.

Another thing to consider, which isn't quite the same, is for the network to estimate Pr(B wins | komi = x) for various values of x. This is different from Pr(B ends up ahead by > x) because the best way to play depends on the komi value. The resulting values would no longer (quite) form a meaningful probability distribution. But it might avoid needing komi as an input. (I don't think it would, though, because the policy distribution should depend on komi.)

In view of the "dynamic komi" experiments of @alreadydone, perhaps we shouldn't actually call the thing "komi"; call it something like "winning threshold". (Because the idea of "dynamic komi" is to tell LZ that what it's trying to do is to get the score above some threshold. "Normally" this equals the komi, or minus the komi, but sometimes a different value is better to stop it playing stupid moves.)

gjm11 commented 6 years ago

(Er, it occurs to me that the foregoing may give the impression that I am, or think I am, some sort of computer-go expert; I am purely a dilettante and what I am strongly in favour of neither has nor should have all that much effect on anyone :-).)

isty2e commented 6 years ago

The score distribution can be modelled by a Dirichlet mixture of Beta distributions, for example. The real problem is what to feed the NN with, i.e. what the training data would be. It is desirable that a network is trained with actual game results instead of tree search results, but I reckon that it will be very cumbersome and costly to generate such data for every board state.

Ishinoshita commented 6 years ago

I am very strongly in favour of making the network estimate probability distributions of scores. The current "value head" is doing this for a special case, something like Pr(black is >= 7.5 ahead of white).

The following paper describes an architecture where a value network ouput a distribution of win rate over different komi values, what amounts to say over different scores.

@gjm11: does-it correspond to what you have in mind ?

At least, seems the closest reference I know related to taking komi into account on the output side. Contrary to @alreadydone's solution very nice trick on the input side, it needs a special learning. But it is also reported to increase the network strength, probably by allowing more information to flow backwards into the value network ( need to predict correct outcome for 7.5 komi, but also for 6.5, 5,5, etc...., for 41 different values of komi).

Multi-Labelled Value Network for Computer Go.

gjm11 commented 6 years ago

Yes, that is very much the sort of thing I have in mind.

betterworld commented 6 years ago

If modifications are made to make LZ greed for more points, I think it would be nice to make it keep the game as short as possible as a secondary goal. Even now, the unmodified LZ sometimes starts unnecessary ko fights or makes ko fights longer than needed. So when it always tries to get more points, you can easily imagine a situation where LZ starts a 50-move sequence with multiple kos just to win by 152 points instead of 150. This would be quite annoying for human opponents and spectators.

Nazgand commented 6 years ago

I support @betterworld 's idea to make games shorter I remember complaining about LeelaZero making games unnecessarily long, and was shot down because dumbpass mode is somehow important(for reasons I do not understand).

My modified idea is to weight the training data. e.g. say a game completed in M moves. Then the game could have a weight of Min(0, 1-M/300). Thus, over time, the network will learn to prefer shorter games.

kennyfs commented 6 years ago

The winning percentage is used to win the game.But using points will make it unstable. Because it will become more militant to get more points.

tterava commented 6 years ago

Keeping the games short conflicts with LZ's primary goal of maximizing its winrate. There's currently no way to predict game length, so you're basically asking to re-train LZ with an additional prediction as output. Then you would sometimes compromise winrate in favor of keeping games short. How exactly?

The goal of this project from the start has been to make LZ as strong as possible. Compromising that goal would make a lot of people unhappy.

herazul commented 6 years ago

I don't even understand the suggestion.

If the game is close, LZ could need these kind of sequences to win. If the game is not close at all, the human can resign.

If you just want it to play more aggressive, i think the dynamic komi leela can fill that purpose if i'm not mistaken.

gcp commented 6 years ago

Come on, we must've covered this dozens of times:

a) MCTS is significantly stronger if its target is the goal of the game (which is TO WIN). Points don't matter and trying to optimize them (instead of winning) leads to suboptimal play, which is never what you want.

b) If for some aesthetic or analysis reason you want to pretend points matter, this can be done by shifting the goalposts (dynamic komi) and trying to WIN the new game.

The goal of this project from the start has been to make LZ as strong as possible.

It's strong enough now that making it more useful for analysis is good. But using score in MCTS does not achieve those goals. It just makes it worse.

Diapa commented 6 years ago

It's strong enough now that making it more useful for analysis is good. But using score in MCTS does not achieve those goals. It just makes it worse.

I don't agree, but you seem so determined I don't want to argue.

herazul commented 6 years ago

I don't agree, but you seem so determined I don't want to argue

You didn't really explained your point. Why would maximising the score would be better ? More score for a move doesn't necessarily mean better chance to win the game, and winning is what matter. If you don't want better winrate analysis but more aggressive review, dynamic komi seems to do the trick.

Diapa commented 6 years ago

Advantages

Can play handicap well since she would just try to loose by as few points as possible. When opponent makes mistake she would punish it as hard as she can, but no trickplays/overplays.

Also reverse komi is no problem, you wouldn't even have to define the komi. I can try to just loose by as few points as possible. Finally LZs thought on the eternal question (which komi is best) will be known (by just looking at blacks lead on the empty board).

She would always play endgame her best. This makes a better playing experience, and since the endgame is good in selfplay the policy network wouldn't learn "endgame doesn't matter" all the time.

When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem", "this endgame move was exactly 2 points worse than this one.. yeah I see why".

We would get kind of the same insights as from Environmental Go https://senseis.xmp.net/?EnvironmentalGo but more precise (for example increases in the value of sente can be observed).

I think I explained my point quite a bit which part would you like to know more about?

gcp commented 6 years ago

I don't agree

You don't agree with what?

You can shift komi around and then try to win the new game that results. I am not clear what advantages your method would have over this, but I know the disadvantage is that it leads to weaker play.

On top of that, some of the advantages you list I fundamentally disagree with:

She would always play endgame her best.

This isn't true. She would try to maximize the score, potentially at the cost of allowing the opponent counter-play and maybe losing. That's not "best". Fan Hui explains this in the Alpha Go movie.

If it were best then we would've have concluded winrate leads to stronger programs many years ago.

When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem"

I am not sure if you can expect such an exact or meaningful mapping to come out.

Can play handicap well since she would just try to loose by as few points as possible....but no trickplays/overplays.

I have no opinion here but people have argued both ways that this is the wrong way to play handicap.

Dynamic komi is more flexible since you can shift the program goal by shifting the komi. So, again, what advantages would playing for territory give?

herazul commented 6 years ago

When analyzing my game I won't have to try understand an arbitrary percent value, instead I will learn things like "this opening lost 0.2 points, not a problem", "this endgame move was exactly 2 points worse than this one.. yeah I see why".

This point show that you didn't listen to GCP : you think you will have a better analysis with score, but it's not true, since GCP explained that the net will play noticably weaker. You won't have a better analysis of the position, you will have an analysis that you "think" is better, but it will not be. If there would be end-game moves wich would maximize score and would be better that what leela play now, she would discover them during training, and she would play them. But they are NOT better move. They are just moves that will maybe give more point WHEN the game end in a win, but these moves will also result in winning less games overall.

Edit - GCP was faster to respond

gcp commented 6 years ago

Let me explain it this way: we try with komi set to 7.5, and have 45% winrate for white. We set komi to 8.5, and now get 55% winrate for white. This means white must be about 1 point behind (incl original 7.5 komi).

This means that if we can get winrates and can vary komi, we can get the score information.

But I'm not sure we can go the other way around? If the network would give the position a "white up 0.5 stone", then we can't translate that to a fractional winning chance, can we? The meaning in winning percentage would be pretty different if we were in deep yose vs it being move 2.

gcp commented 6 years ago

I am very strongly in favour of making the network estimate probability distributions of scores.

This is a better approach because it avoids some of the problems I mentioned above. Some academic teams are working on this (based on Leela Zero).

And note it's similar to the current komi planes work. By sliding the komi setting there, you exactly get such a distribution.

Diapa commented 6 years ago

If she is strong enough (I think so) she knows what is an overplay in endgame and what is not. And if a play has a 50% chance to win 3 points and 50% to lose 30 she still wouldn't do it (on average a loss).

I don't think a strong bot can even learn to do trickplays since it is as much a mistake to it as any other mistakes, it doesn't know what tricks the opponent. Therefore the only way is to not make it do trickplay as strength increases. Current Leela thinks first line is the best trickplay :)

This point show that you didn't listen to GCP : you think you will have a better analysis with score, but it's not true, since GCP explained that the net will play noticably weaker.

I try my very best to listen! And LZ can't be noticably weaker, not to me, I couldn't even tell if Leela, LZ, AlphoGo or Go Seigen is stronger from their games.

If dynamic komi is so precise you can set komi to 8.5 or any other value then I am sorry this is almost as good and should just be implemented into Lizzie.

gcp commented 6 years ago

And if a play has a 50% chance to win 3 points and 50% to lose 30 she still wouldn't do it (on average a loss).

The problem is that if it's a 2% chance to win by 100 points and a 98% chance to lose by 1 point, she would try to win by 100 points. Apparently that happens often enough that people who did MCTS research easily noticed the strength difference.

Diapa commented 6 years ago

ok I get it.

Diapa commented 6 years ago

The problem is that if it's a 2% chance to win by 100 points and a 98% chance to lose by 1 point, she would try to win by 100 points.

But wait the current LZ would also do that! She would shoot for the small chance to win by a lot. But I get it you just meant the other way around.

(I still think my idea is better but since people don't seem to agree we can forget it)

isty2e commented 6 years ago

As I explained earlier, to estimate (and therefore to maximize) winrate, it is always more accurate to estimate the winrate directly. The two-step mapping, i.e. board state->expectation of score->winrate is trivially less accurate due to the loss of information.

Diapa commented 6 years ago

the point wasn't to make winrate more accurate, it was to also get information about the score.

roy7 commented 6 years ago

With a probability distribution on win rates it can open up some interesting options in the tree search. Can the dynamic komi tricks be used to estimate distirubtuions in such a way?

Ishinoshita commented 6 years ago

@gcp: although I agree with you (what amounts to say that I trust you in fact, as I'm no expert ;-) and @alreadydone proved it workable, I wondering about your comment:

And note it's similar to the current komi planes work. By sliding the komi setting there, you exactly get such a distribution.

I think it is not really equivalent. Learning a distribution of win rate over komi (training the ouput) would necessarily be done with monotonical training data, leading to the expected montonicity, even over high komi values. While using the 'input komi plane trick' (tweaking the input) is free-lunch in terms of training, you have to take the network as it is, meaning there is no guarantee that this approach will work. ELF win rate distribution vs komi input plane turned out to be too sharp/non-monotonic to be used. And not all versions of LZ were smooth/monotonic enough.

alreadydone commented 6 years ago

The problem is that if it's a 2% chance to win by 100 points and a 98% chance to lose by 1 point, she would try to win by 100 points.

But wait the current LZ would also do that! She would shoot for the small chance to win by a lot. But I get it you just meant the other way around.

A better example is that if you use the expected score as evaluation during search, it will prefer {5% winning by 20.5 points, 95% losing by 0.5 points} over 100% winning by 0.5 points.

Diapa commented 5 years ago

This should be reconsidered as it has been shown to speed up early training https://arxiv.org/pdf/1902.10565.pdf which I think mostly removes the point that it would weaken the AI.

leela-zero / leela-zero

Replace winrate by score? #1699