tranhungnghiep commented 6 years ago

Having followed Leela project for some time, I noticed some problems in training strategy that may affect its training progress and general strength. Here I present the main problem and proposed solution. This problem is originated from AlphaZero algorithm, so if the fix worked, it would become an advantageous factor of Leela compared to AlphaZero (or at least I hope so).

Problem

Limitation of the current training strategy

As I understand, in self-play both players use noise (temperature and Dirichlet prior) to explore new moves. A new move could be either good move or blunder, currently this is determined by the outcome of the generated game. The problem is that, when both players use noise, a win could be resulted from either the opponent's blunders or one's own good moves. Thus there are more false postive signals in training data. Moreover, for a good move to work, it needs to be part of a consistent maneuver. With random blunders, there would rarely be consistent maneuvers, thus there are less true postive signals in training data.

Potential consequences and examples

Tactical weakness and blunder: essentially if Leela expects opponent to make blunders, it will not play the best tactical moves and more easily fall for tactical trap, or making blunders in good position. One example of this is the promotion bug in v0.6, Leela's strength reduced not because of black's bad promotion (evidence is bug-fixed code did not help). But because white expects black to make promotion blunder, thus it does not play the best tactical moves.
"Valley of ignorance": I use this phrase to refer to the case when both opponents are not very strong, then each of them may make moves that look strong and the other cannot dispute, but actually disputable by a stronger player. This happens when both players make blunders.
AlphaZero's long endgame: not very evidential here, but maybe these very long endgames were resulted from the lack of consistent maneuvers in training data.
Analogy to human self-playing game: from my experience, when human self-play chess, we usually play two different roles, exploring new ideas in one side and trying the best of current ideas to counterattack in the other sides.

Proposed solutions

Step 1: Consistent game generation strategy

Analogy to human self-playing game, the idea here is to use noise in only one player to explore new moves against a fixed strength, consistent opponent. For example, we could use Leela without noise as the fixed strength opponent (denoted L+N vs. L). In this case, a win or a loss is mainly resulted from the move of one side, thus it reduces false signal in generated games. More generally, we could use different noises for the two players (denoted L+N1 vs. L+N2), where their noises are tuned hyperparameters.

Specifically, we need to use different noises for the two self-playing players. Using more noise in one player (and recording its win to explore new good moves), while using less or no noise in the other player (and recording its loss to avoid current bad moves).
Step 2: More informative training data sample

Notice that in the case (L+N vs. L) above, there are roughly 4 types of moves: (M1) a noisy move in a winning game; (M2) a noisy move in a losing game; (M3) a fixed move in a winning game; (M4) a fixed move in a losing game. These moves carry different information. Leela should mainly care to learn to play (M1) which are new good moves and avoid (M4) which are current bad moves. The idea here is to use appropriate distribution in sampling moves from generated games, with more emphasis on informative moves. This should be a tuned hyperparameter.
Specifically, we need to put more weight on moves that carry more useful information, that are new good moves that lead to a win (M1) and current bad moves that lead to a loss (M4).

Conclusion

Here I described a problem with the current training strategy of Leela (and AlphaZero), followed with a two-steps solution. The main idea is to be more selective in game generation and training, with explicit strategy to explore and improve against a good opponent, essentially to let one player explore new good moves, while one player avoid current bad moves. This may make training progress faster and improve general strength of Leela.

Future extension

Generally, assuming we have an oracle that outputs very strong tactical moves for every position, but still has a tactical horizon. An example is a Stockfish engine with deep tactical search. Using this oracle as the fixed strength opponent, we can explicitly train Leela to see beyond the tactical horizon of this oracle and potentially beat very deep tactical search. Note that against a very strong opponent, good move will be very serendipitous, thus it may only be practical to continue training an already strong Leela. Note that training by self-play against Stockfish is not the same as imitating Stockfish moves, because we can control training data sample, it would be learning to beat deep tactical search.

Remarks

This is a theoretical idea, experiments are needed to compare it with current training strategy, especially at scale.
This is a deviate from Google's algorithm, thus not an exact replication of AlphaZero, but I assume this is expected.
Constructive discussions are very much welcome. Reinforcement learning is not my major expertise, so feel free to correct me (or defend the idea if you agree :) ).

jjoshua2 commented 6 years ago

I agree, but I think it's currently outside the scope of the project. What I would propose is try training your own net based on leela's self play games, but using SF/TB to adjudicate after a blunder to improve the value head. And not sample many times out of the crazy endgames. Basically emulating resign

hsntgm commented 6 years ago

@tranhungnghiep with current algorithm leela already can not gain strength much.When i look latest networks she stuck at some point.After community decided to pass 192x15 network it is nearly impossible the produce game with mid-range consumer gpu.I get nearly half nps while self playing.They do that because of gain strength but it seems it does not work.

And without google gpu farm it seems self training not carry leela'strength to stockfish level.In my opinion the one way is supervised training with very well optimised game data with ply depth and win lost draw ratio %33..But this way makes her copy of other engines without own style.

They do great work here.But needed agorithm has not yet been discovered.

tranhungnghiep commented 6 years ago

@jjoshua2 Thanks for your comment.

I guess you are thinking about some flavors of supervised learning with SF/TB, but here I want to say about self-play training. It is essential to change self-play strategy and training data accordingly, this could not be done as post-processing of current self-play games, but needs new self-play games from the community.

I know this is a deviate from AlphaZero's algorithm, but I assume that, if it were able to improve Leela's strength, the community should pick it up. Unless the goal of this project is to exactly replicate AlphaZero. Even if this were the case, I still think the modification does not break anything but only helps.

jjoshua2 commented 6 years ago

I'm more open to elo gainers that deviate from AZ than most are. Most people are very into the zero aspect. It's hard to show things are elo gainers though as you say, if they need self play games and the community doesn't want to try it until it's shown to gain elo. That's why I suggest you show it gains some with post processing first.

tranhungnghiep commented 6 years ago

@hsntgm Thanks for you comment.

There's a point I would like to clearly point out, that the proposed strategy side steps supervised learning, thus training with self-playing between (Leela+Noise vs. Stockfish) would not be imitating Stockfish if we do not sample Stockfish's moves.

tranhungnghiep commented 6 years ago

@jjoshua2 I see. That would be kind of troublesome, I don't have enough resource at the moment, maybe not in a few months.

Anyway, I would like to invite the everyone to justify this approach theoretically here, modify/improve and make it more failure-proof. Then maybe if it is convincing enough we could switch to it.

jkiliani commented 6 years ago

It sounds as if you suggest less exploration by Dirichlet noise, and less use of temperature in training. Neither of these are good ideas in my opinion:

Noise is essential to discover good moves not currently known by the policy. Noise also only has a very small negative impact on strength, since it affects only the root node. By deactivating it, you'd prevent Leela from discovering new tactics when she reaches a position to do so in training games.
Training game blunders are caused by temperature, i.e. selecting root moves proportionally to visit count instead of greedily (i.e. only playing the strongest move). Temperature is essential to training for two reasons: -- It generated variety in the positions encountered, which helps the policy priors a lot by reducing the occurrence of positions unknown to Leela (which are often handled badly by a neural net). -- Also, paradoxically, even games overturned by blunders can actually help since they keep the value head from becoming too stratified. If every training game has the theoretically correct outcome, the neural net will eventually think positions with a slight advantage are certain wins and those with a slight disadvantage are hopeless. It was for just this reason that Leela Zero (Go) just switched from using temperature only in the beginning to using it all game, just like Leela Chess.

tranhungnghiep commented 6 years ago

@jkiliani I see your points, I know noise is essential in self-play games. But actually I suggest use more noises in one player and less/no noise in the other player. The main idea is to let one player explore new good moves, while one player avoid current bad moves.

jkiliani commented 6 years ago

While I wouldn't totally exclude this, I see a problem here in that by having asymmetric noise, you make the training data of one side more valuable than that of the other. Noise actually only rarely results in a move unknown to the policy head being played. Most of the time, if noise discovers a good move, it gets a number of visits but considerably less than the moves with high priors, and just enters the training data so the next generation network is actually aware of this move in the first place. Reducing the noise on one side just makes these kinds of discoveries less likely...

tranhungnghiep commented 6 years ago

@jkiliani I think that is a valid point, thus we can counter that with using more noise in one player and tune distribution to sample moves. There is a tradeoff here. But at least the proposed approach provides new solution to address some problems.

tranhungnghiep commented 6 years ago

Maybe we need to reach a consensus on recognizing the problem first.

What do you think @glinscott @gcp @Error323? Could you please lead the discussion to the right direction?

gcp commented 6 years ago

I think one can definitely see some (misevaluation) in what the program has learned because it relies a lot on the opponent blundering. Whether this is actually a problem that will be learned out or not, I don't know. The AZ papers suggests it will.

That said, we analyzed that the interaction between some improvements is causing the blunder rate to be higher than intended: https://github.com/gcp/leela-zero/issues/1355 I will likely "fix" this for Leela Zero.

I think it's hard enough to replicate the A(GZ) result fully and understand even how even the most logical improvements interact.

One can easily think of a million improvements or changes to the learning procedure. Producing data showing that they are an improvement is the problem. Without data that it's at least as good, I don't particularly see why one should care. It's just all talk. Unfortunately most people prefer to talk instead of producing data, because the latter is harder.

tranhungnghiep commented 6 years ago

@gcp I don't really understand your points, and I think you don't really understand my original post.

The issue you link is quite a different problem, actually that seems not a real "problem".

I agree that there are millions of worthless changes that could be easily falsified at the first look. But much less worthwhile changes that could be looked at again. I may be biased, but I think every research idea starts with thinking, not tinkering.

Now don't get me wrong, I don't put this idea up for fun. I put it up because it has the potential to help the community, cut training time, save contributors' cost and improve Leela's strength. I will come back to implement and test this when I have a chance, but I put it up early so that someone may take a part and help the community earlier. What is the purpose of a community project anyway?

Replicating AZ is difficult, why not make it easier?

jkiliani commented 6 years ago

@gcp's point is exactly that you can't try every idea anyone has in a distributed setup. Naturally, everyone thinks their own ideas are the best, but in reality there are a whole lot of ideas that wouldn't work and only a couple that would provide a measurable gain. The people who run such a distributed setup therefore ask for proof or at least strong indications that an idea is feasible and would have a positive effect before actually trying it. Anything else leads to chaos.

I can understand that you feel your idea is good, but believe me, I proposed a lot of ideas to @gcp as well that fell through due to insufficient evidence, and a couple he accepted after I verified them to the best of my capabilities. So my advise here is the following:

Make your suggestions concrete. At the moment I still have trouble finding your actual suggestions within a lot of text that sounds nice but is not actually specific
Think of the best way you can gather data that at least strongly supports that these suggestions actually improve things, get as much such data as is feasible, and then present it in a convincing manner.

While this still would not guarantee that your suggestions are acted upon, it would considerably improve the chances that they would.

tranhungnghiep commented 6 years ago

@jkiliani Thanks for your constructive comment. I edited the issue to make it more clear. As for supporting data, I will try when I have a chance if no one try yet. When I have a chance to try it, I may as well do full experiments and publish a paper.

Anyone is welcome to try my proposal here. If you don't hurry, I will get to it before you ;)

Let's see how we can work it out together. Please put your question/discussion/result here to keep track.

gcp commented 6 years ago

@jkiliani said it better (and nicer) than I could have. The problem is not coming up with ideas ("starts with thinking"), the issues here are full of them. The problem is people willing to implement and test them. So reading

I will come back to implement and test this when I have a chance

is pretty awesome and good luck!

tranhungnghiep commented 6 years ago

Well with luck, it may reduce required self-play to 4.4 million games and cut training time to 1/10, get 100 Elo higher than AlphaZero with a clean style, especially in endgame.

tranhungnghiep commented 6 years ago

Of course my comment above is an exaggeration, that much improvement would require "great luck".

Also, when I said some issues are not "real problem", I didn't mean they are unimportant. They are important but just about hyper-parameter tuning. When Leela training continues to suffer, a novel view and solution may help.

glinscott / leela-chess

[Idea] Improve training strategy: reduce false signal in game generation and training #551

Problem

Limitation of the current training strategy

Potential consequences and examples

Proposed solutions

Step 1: Consistent game generation strategy

Step 2: More informative training data sample

Conclusion

Future extension

Remarks