leela-zero / leela-zero

Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper.
GNU General Public License v3.0
5.33k stars 1.01k forks source link

Should some self-play be without T=1 randomness and noise… for now? #1403

Closed Mardak closed 5 years ago

Mardak commented 6 years ago

In doing the analysis of a joseki that 192x15 is learning from ELF https://github.com/gcp/leela-zero/issues/1392#issuecomment-388164961

I noticed that 192x15 is indeed learning and could continue learning this particular joseki without any more ELF self-play (i.e., self-play with 3200 visits in every symmetry generates training data to move the prior closer to ELF's ~100% prior for the analyzed move).

However, the win rate value head for current 192x15 for the move is around 45% whereas ELF says it's around 60%. And similarly looking at the progression of recent 192x15 networks, while priors are increasing steadily, the value isn't increasing as fast: https://github.com/gcp/leela-zero/issues/1392#issuecomment-388167586

One guess is that with the randomness in self-play, the eventual winner is only very loosely tied to the current board state. However, if some self-play was done without additional randomness, the training data would maybe be more accurate in improving the value head.

I suppose clearly, this is not necessary as AGZ, ELF and LZ all have learned a more appropriate value… eventually. This is probably because even with additional randomness, the "median" play should be close to no additional randomness.

Related, I believe self-play without additional randomness would help generate correct play training data for all symmetries to learn from.

Basically at a high level, there's a lot for 192x15 to learn from ELF, but maybe it should also study from what it has already started to learn instead of also trying to learn even more at the same time.

Ishinoshita commented 6 years ago

Shouldn't selfplay games with no randomness at all (t~0, no noise, all symetries) be very repetitive, if not identical but for floating point calculation errors ? In the short term, a proportion of such selfplay games might help LZ improve her strength along her favorite line of play, but this line of play may fade and totally disappear in the course of of the overall learning process, so that might be a waste of time to focus on that on short term. This would also go the completely opposite of what we are doing now to improve selfplay games diversity (fix dirichlet bug, t=1 all along the game). Randomness increases game outcome variance and make value head learn value for a given position slower but it is expected it will learn faster the value of more diverse positions. Also, possibly, LZ 'prefered' line of play would not lead to the joseki you have identifed (but maybe you have checked).

Mardak commented 6 years ago

Shouldn't selfplay games with no randomness at all (t~0, no noise, all symmetries) be very repetitive

That's why I only said to remove the added randomness of T=1 and noise. Symmetries introduce randomness for games in self-play and matches as every move is picking a random symmetry out of 8 options. That's why even though match games are played without T=1 and noise, they are still different from each other.

The server gives a 64-bit random_seed for every task, which influences the randomness of games. The first move of the game has 8 possible outcomes given the symmetries and assuming a 4-3 play. The second move also has 8 possible outcomes, so 64 different games already. Third move is another 8 times more possible games, so 512. Just looking at the first 10 moves means there's 1073741824 possible outcomes. Even if we assume all games resigned at move 91, 8^91 is 1.517710072051352e82 (although given the "limitation" of 64 bits of randomness… which is then only 1.844674407370955e19 games).

jkiliani commented 6 years ago

When you look at match games, you'll notice that they are actually a lot more similar to each other than training games. While symmetries do add randomness, the joseki played in match games tend be quite repetitive for instance. This change goes directly counter to T=1, which was picked for a reason: We don't really want to end up with another ELF, which has so such a stratified value head that it's useless for handicap play. I posted the reasoning for this in https://github.com/gcp/leela-zero/issues/1311#issuecomment-386546737:

Intuitively, anyone would think that giving the "right" value training target will have better results in the end, compared to one where chance plays a part, but there are two effects from temperature that should compensate this (and may even overcompensate, judging by AZ vs AGZ20):

  • The value head learns to differentiate between won positions that can still be screwed up, and those that cannot reasonably be overturned. With temp=0, an advantage means it's won. Knowing the distinction should lead to the net playing for a safety margin when winning, which may be helpful in case something is overlooked and is also closer to human play.
  • The diversity of positions encountered in training will be measurably greater, which should help the net considerably against other opponents which will not play moves that the policy priors currently think are best.
Mardak commented 6 years ago

Also, it's unclear what number or percentage of self-play should be no-additional-randomness. AGZ used 10% with no-resign to reduce false positives, i.e., incorrect resign of winning side.

The main premise for removing the added randomness is for more accurate / faster training of the value head.

Would even 1% of self-play even achieve that?

This change goes directly counter to T=1

Yes, and even further away as the proposal is to remove T=1 for the first 30 moves too. But that's also why I said for some self-play and for now. Although as above, unclear what limits make sense for either of those attributes.

Perhaps one way to think about it is similar to the adjusting of learning rate. ?

Mardak commented 6 years ago

The value head learns to differentiate between won positions that can still be screwed up, and those that cannot reasonably be overturned.

I think this is quite insightful and important. And related to my initial "median play" comment, even with added randomness, the value head will figure things out with enough games.

Perhaps getting off topic and probably(?)/definitely(?) different from the current AGZ network architecture, it would be interesting if the value head provided a confidence interval to differentiate exactly what you point out -- a for-sure winning position (at least based on what it's been trained on) vs a risky potentially winning position.

Edit: Although thinking about it a bit more, I suppose a truly safe winning position would have value near 100% win rate, while lower confidence would be lower than that. I guess the value head kinda already encompasses the uncertainty?

dzhurak commented 6 years ago

Ok, this is just ridiculous http://zero.sjeng.org/view/988d92e81aa44fa097f8db0fcb4f3ca95b0b7dc01810df5a8a27f629e50917ce?viewer=wgo Black resigned in if not winning but at least equal position. I don't like this much randomness in self play. Does random means absolute random move or should it be just random let say from top-10 moves?

PhilipFRipper commented 6 years ago

@dzhurak You misunderstand how this works. It's not "just ridiculous" but a useful part of training.

dzhurak commented 6 years ago

Ok, random moves are useful. I can understand it. It adds diversity. But to resign in that position is ridiculous. Black are fine in there.

PhilipFRipper commented 6 years ago

Passing is a move. Passing can be better than many moves. No matter what position, passing is an option. It's the only move that is always possible. That means it comes up a lot, and it's rarely the worst possible move (self atari is worse, etc). In any noise, it will thus happen. It in no way makes the network dumber. Only be concerned if it comes up in match games.

PhilipFRipper commented 6 years ago

(it won't come up in match games just because it happens in self play, that's not how this works)

gcp commented 6 years ago

Given that this was a 0.15 client, note that passing had at least 2 visits, i.e. it didn't know yet it was 100% losing. I'll know better next time.

gcp commented 6 years ago

As for the original question, we specifically did T=1 in order to try to get away a bit from ELF, and yes, it may mean slower progress on the pure strength side. That's a tradeoff.

Mardak commented 6 years ago

For the trade-off, I'm just suggesting that it doesn't need to be either extremes of no-additional-randomness (solidify known learning without learning unknown stuff) and T=1 all moves (slow to learn what's actually stronger). If there's actual value in each behavior, it would seem that doing a mix of the two should be better although that just opens up another decision of what's the right mix.

Mardak commented 6 years ago

@Eddh had a good point that doing some games with T=0 should be pretty similar to just setting a T between 0 and 1. This got me wondering how often are the most visited moves getting picked anyway even with T=1. Below is a graph of varying visit counts with the latest networks each size: 128x6, 128x10, 192x15, 224x20 with an extra dotted line for 192x15 with T=1 but allowing 1-visit moves to be selected. chart-1

So on average with the current 192x15 network and 3200 visits, 70.7% of moves selected randomly with T=1 would be the most visited move, which would have been picked with T=0 anyway. These numbers were from running 3 "auto" games for each appropriate network and visits with noise turned on. I also analyzed the most recent Haylee game for LZ playing as well as its pondered move (where noise is not turned on, so fpu reduction is in effect), and T=1 would have picked the top move around 68.4% of the time.

Looking at various T values for 192x15 with 3200 and 1000 visits for picking the most visited move:

T 3200 1000
0.1 96.8% 96.5%
0.2 94.3% 93.4%
0.5 87.1% 81.4%
0.8 78.6% 72.8%
1.0 70.7% 63.7%
1.25 59.1% 52.8%
2.0 32.7% 31.6%
5.0 10.4% 13.1%
10 6.4% 9.1%

One interesting observation is that AGZ used 800 as brought up recently in #1416, but our current self-play uses 3200 visits, which generally results in search putting more visits into its top move. Going with 800 visits instead will allow T=1 to pick the not-most-visited move more often leading to more randomness.

jkiliani commented 6 years ago

That's a very good point in favor of both dropping visits and also enabling time management for self-play, since only moves where time management doesn't trigger anyway will help game variety. The only real drawback I can think of is that time management will give away the gains in training data compression we got from picking round numbers plus one for visits.

Mardak commented 5 years ago

Looks like T=1 for whole game is no more, so closing this.