Some ideas about finding the best step in training

Chenvincentkevin commented 2 years ago

Since there're many games that will be played in training, I think it's reasonable to play more different moves like the following 1) move that will gain the most points 2) move that has the highest winrate 3) moves that seemed to be less reasonable (incredible oversight situations or so (?): maybe requires large playouts) or 1, 2 both of course, such as in the following case: (The 28th move)

sgf: (;DT[2022-08-18]KM[7.5]PB[]RE[]DZ[G]PW[]SZ[19]CA[UTF-8]AP[Lizzie: yzy2.5.1];B[pd];W[dd];B[pp];W[dp];B[cc];W[dc];B[cd];W[de];B[bf];W[qq];B[qp];W[pq];B[nq];W[nr];B[mr];W[oq];B[np];W[op];B[oo];W[po];B[ro];W[pn];B[on];W[om];B[no];W[rq];B[pm];W[rp](;B[qm];W[qo](;B[pk];W[ml](;B[nm];W[qf];B[qe];W[pf];B[nd];W[ol];B[ni];W[oj];B[ok];W[nl];B[nj];W[rj];B[qk];W[pi];B[kj];W[kl];B[kn];W[jn];B[jo];W[in](;B[km];W[kk];B[jj];W[ik];B[io];W[hn];B[ho];W[ko](;B[lo](;W[go];B[il](;W[jk];B[jm];W[hl](;B[gp];W[gn];B[fp];W[me](;B[ne](;W[nf];B[mf](;W[mg];B[lf];W[lg](;B[ng];W[of];B[qi](;W[kf];B[le];W[qj](;B[pj];W[qh](;B[ii];W[kh];B[lj](;W[jd];B[ld](;W[oc];B[pc];W[oi];B[nk](;W[ig];B[gi](;W[gk];B[do](;W[eo];B[ep](;W[co];B[dn](;W[dq];B[cn](;W[mk];B[mj];W[kp];B[iq];W[bo](;B[fo];W[ek];B[em](;W[fn];B[en])(;W[cf]))(;B[ie]))(;W[kp])(;W[bo]))(;W[mk])(;W[kp])(;W[cp]))(;W[kp]))(;W[mk]))(;W[fj]))(;W[kp]))(;W[kp])(;W[oi]))(;W[md]))(;B[ph]))(;B[ph]))(;W[qj]))(;B[kf]))(;W[qi]))(;W[kp]))(;B[md]))(;B[hm]))(;W[jm]))(;W[kp]))(;B[kp]))(;B[io]))(;B[fq]))(;B[ol]))(;B[ol]))

Video link: https://www.bilibili.com/video/BV11N4y1T781

It's a game against 60B played by 8 cards of 3090 for B and 4 for W(Not the complete game), which is really large compute power, it thinks R7 is the best. But in my opinion, since P8's winrate and estimated score are both higher than R7 under large compute power, P8 will be the best if we wait for a while. But in the real situation, we don't have enough time to analyse which step is trully the best one. That the reason why I think trying different points is crucial.

And as I mentioned in issue #670 , through estimating score lead of all the resigned rating and training games played by 30B+ networks, we can find out if B/W leads more than 30(maybe lower) there's a blunder of whether sides. And take the "result:B/W+(a number more than 10)" in to consideration simutaneously. Train these situation when the blunder happens by playing DIFFERENT moves will possibly help find out the problem. That's all, looking forward to your early reply! @lightvector

Chenvincentkevin commented 2 years ago

![Uploading 无标题3.png…]() ![Uploading 无标题.png…]()

Chenvincentkevin commented 2 years ago

无标题1 无标题2 无标题3

Chenvincentkevin commented 2 years ago

And what if we change the config file of training? Will it be worth trying?

michito744 commented 2 years ago

@Chenvincentkevin

A good example.

Often in AI, a strong alternative candidate is given very low weighting and is not fully explored. In some cases, the alternative candidate may even be by far the best move, which is the reason why AI's evaluation values are unreliable in critical situations.

lightvector commented 2 years ago

@Chenvincentkevin - thanks for the analysis of that position.

It's a game against 60B played by 8 cards of 3090 for B and 4 for W(Not the complete game), which is really large compute power, it thinks R7 is the best. But in my opinion, since P8's winrate and estimated score are both higher than R7 under large compute power, P8 will be the best if we wait for a while. But in the real situation, we don't have enough time to analyse which step is trully the best one. That the reason why I think trying different points is crucial.

KataGo already tries a lot of different moves during training. Try downloading a training game and analyzing it, you should find that multiple times per game KataGo does try a different move than the one your analysis believes is the best move.

Are you proposing to try even more different moves each game than the training already does? Almost all of the the time if you try a new move, it will be bad, because almost all moves are bad. And if you try too many different moves during a game, most of which will be bad moves, you introduce a lot of noise - it becomes very hard to tell what was the cause of winning or losing because of all those different moves.

And as I mentioned in issue https://github.com/lightvector/KataGo/issues/670 , through estimating score lead of all the resigned rating and training games played by 30B+ networks, we can find out if B/W leads more than 30(maybe lower) there's a blunder of whether sides. And take the "result:B/W+(a number more than 10)" in to consideration simutaneously. Train these situation when the blunder happens by playing DIFFERENT moves will possibly help find out the problem. That's all, looking forward to your early reply! @l@lightvector

I'm not sure how to do this. Very often the time at which the winrate or score changes is not the time that the mistake happened. The bot realizes the mistake at this point but actually the problem was earlier. A smart human player with the assistance of AI might be able to manually do a lot of analysis and exploration of the variations to figure out what the "real" mistake was, but I don't know how to write computer code that does this.

If you could demonstrate a strict formula or criterion or procedure that involves zero human judgment (i.e. it is one that could be implemented by a computer program), and show on a random sample of 5-10 rating games that it does has a good chance of suggesting an alternative variation that would "help" teach the net about what the mistake was (instead of being also itself a mistake, or simply resulting in an entirely different game that has no relationship to the original game), and that isn't too costly (running millions of visits is too expensive). I would certainly consider implementing it.

Do you know how to program? If you do, you could even try writing such code yourself, such as a script in Python that uses KataGo's analysis engine (https://github.com/lightvector/KataGo/blob/master/docs/Analysis_Engine.md) to demonstrate your method to identify what position and alternative move should be given additional training. If you can come up with something better than what already exists, that would be useful, not just for KataGo, but possibly for other users too.

Chenvincentkevin commented 2 years ago

To be honest, I don't know how to write Python codes now. But I read the webpage you posted. From my humble perspective, it's the difference of moves which made the net improve.

And here're some possible places for adjustment in a normal training game that I suppose if I understand what it means correctly (sorry if below I wrote wast your time! I don't know how it worked out precisely) :

rootPolicyTemperature set to a proper one
rootFpuReductionMax=0
//includeOwnershipStdev (needs considering)
//allowmoves (needs considering)
(config file) when doing self-play, A net config PlayingDoublingAdvantage 3.0, while B's 0. And other various settings provided by michito744 before.

And when B net is playing in a normal training game, follow below:

figure out number of moves worth trying (precisely: very near estimated winrate/score BUT 50% lower search for that point, and highest score/winrate points, and the best point of course)
play the best move as normal
For the others, store tasks with already moved game in the training server (will be trained possibly on another account)

Q1: Will the above process help the net have a better understanding of the values of different moves?

The second idea is to find out how the blunder happened. First of all, finding situaions : when the bot ignores resign and score lead is more than 10 / Resigned games but score leads more than 30 (a proper score). Secondly, Finding out the point that winrate/score crashed significantly (more than a proper percentage than estimated) . Thirdly, train the situation two moves before it "crashed" (first move still will be played by the crashed player) in an effort to help the net realize it one step earlier, MUST avoid move before (avoidMoves). The situation in the third step should be trained for several times (if it still crashes, finish training the game and avoid move again). Fourthly, Repeat the third process (two moves before) for several times until he can avoid crash (winning more than 50% of the whole cases (not all of them, but a small part of it, maybe you can understand that!))

It will definitely use a lot of resources, so it might not be worth it. Q2: But are there any more ideas to improve it or new ideas?

There must be a lot of mistakes in my comment. Please point them out and I'm delighted to hear.

michito744 commented 2 years ago

@lightvector

When KataGo loses badly, it usually breaks down from the point where the opponent plays moves that KataGo takes very lightly.

That is the difference between KataGo and the top AI, no matter how much reinforcement learning is done.

lightvector commented 2 years ago

no matter how much reinforcement learning is done.

I think I recall you expressing similar sentiments in the past, but if so, you might be surprised! :)

In my experience working with game-playing AI and with deep neural nets, once you have any algorithm that scales well with compute power in a game...

...then even when you make major improvements, typically the improvements are similar to to moderate constant factor differences in total compute invested (including things like increases in network sizes). That is, even without some particular improvement, you would also still eventually reach the same overall strength, you just might need e.g. 1.5x, or 2x more total compute power (depending on what that improvement is).

So, improvements can have a large impact, but also usually never so large that you can say "no matter how much reinforcement learning is done". An impact can appear large, but still only ultimately be equivalent to some not-far-from-linear scaling on the training, such that more training could actually still make up for it.

One other counterintuitive thing about deep learning is that generally one can not simply observe a difference between methods that seems large and conclude that the difference is "too large" to produce by simply doing more training. For example, if you were to observe KataGo today playing a sufficiently older version of itself, such as from KataGo's runs for its earlier publications, even though those versions themselves were already superhuman, you might see today's KataGo crush them such a degree that it would seem insurmountable. The current Katago may find certain good moves rapidly that the old one would not realize to be good even with hundreds of millions of playouts... ...yet the difference between KataGo and its older versions is only the training time, plus the addition of several improvements that together made that search and training, e.g. merely 2x, 4x more efficient in total.

Outside of Go, you can also see more examples of the same general phenomenon. For example with large language models on some difficult verbal or logical reasoning tasks, the model might score close to 0% on some benchmark, and as you make the model bigger and train more, it keeps scoring 0%, 0%, 0%... except if you keep training even longer, even without changing the algorithm at all, suddenly it can understand the task and starts scoring well on it, e.g. 30, 40%. There can be a point where abruptly the model gains a new understanding of some specific task or move or shape that it entirely lacked before, and only due to more training, without any change in algorithms.

Of course, even a "mere" 2x training algorithm improvement is huge - if you train for a year, then a 2x improvement means that compared to reaching the same level without it using 2 years, you save an entire year. If you ever happen to learn of technical details on how to get another 2x improvement that you'd be willing and able to share, please feel free. :)

lightvector commented 2 years ago

And here're some possible places for adjustment in a normal training game that I suppose if I understand what it means correctly (sorry if below I wrote wast your time! I don't know how it worked out precisely) :

rootPolicyTemperature set to a proper one

rootFpuReductionMax=0

//includeOwnershipStdev (needs considering)

//allowmoves (needs considering)

(config file) when doing self-play, A net config PlayingDoublingAdvantage 3.0, while B's 0. And other various settings provided by michito744 before.

KataGo already does 1, 2, 5. It uses a root policy temperature of 1.4 to 1.1 to regularize the policy learning, and sets the root fpu also to 0 so as to allow more exploration and avoid biasing against policy tails as much. And KataGo already does some games with playout doubling advantage! Training using "playout doubling advantage" is what caused KataGo to understand what it means in the first place. If KataGo didn't train on it, this option wouldn't even exist.

For 3 and 4, I either don't understand what the suggestion is yet, or think that the suggestion doesn't say how to solve the underlying challenge. For example to allow or disallow only certain moves to try to force more exploration, one also needs to say how to stop that exploration from biasing the policy and values negatively or wasting compute. (It's easy to focus narrowly on an example of a missed good move and say "we should explore more moves". But most moves are bad! If you naively try to explore a lot more you can easily just waste time and play worse due to all the additional bad moves you consider).

figure out number of moves worth trying (precisely: very near estimated winrate/score BUT 50% lower search for that point, and highest score/winrate points, and the best point of course)

play the best move as normal

For the others, store tasks with already moved game in the training server (will be trained possibly on another account)

Q1: Will the above process help the net have a better understanding of the values of different moves?

The second idea is to find out how the blunder happened. First of all, finding situaions : when the bot ignores resign and score lead is more than 10 / Resigned games but score leads more than 30 (a proper score). Secondly, Finding out the point that winrate/score crashed significantly (more than a proper percentage than estimated) . Thirdly, train the situation two moves before it "crashed" (first move still will be played by the crashed player) in an effort to help the net realize it one step earlier, MUST avoid move before (avoidMoves). The situation in the third step should be trained for several times (if it still crashes, finish training the game and avoid move again). Fourthly, Repeat the third process (two moves before) for several times until he can avoid crash (winning more than 50% of the whole cases (not all of them, but a small part of it, maybe you can understand that!))

It sounds like you have some interesting and specific ideas about how to identify good alternative moves for training on blunders. Would you be interested in testing it? Even if you can't program, you could try writing down an exact method (e.g. exactly what winrate or score difference is needed to be "significant", the exact number of moves you go back, exactly which alternative move you consider instead, etc.), and then go over 10-20 games and manually apply that exact procedure by hand. If the procedure also involves re-analyzing some positions to see its new winrate, you could use an analysis program to do a fixed limited 1000 or 2000 visits whatever to simulate the computation budget that would be available. If it doesn't, revise the method, and try again on a new different 10-20 games, and so on. So one could see if that procedure produces a good result on those games. If it does anything wrong, one could revise the procedure, and try again on a new different 10-20 games. (after making changes, always test on new games not tested yet to avoid overfitting on old games with a procedure that works on only those but games doesn't work in general).

With the specific kinds of rules you mentioned, I'm worried they might not be effective enough at finding a useful point to branch, or distinguishing blunders from other kinds of situations. For example, what if you go two moves back and that move is in the middle of a forcing sequence where playing any other move would be ridiculous? What if the crash is not due to a blunder at all, it's just that the player gradually fell behind, and only now is it close enough to the end of the game to realize it, and none of the last 50 moves is a major mistake?

michito744 commented 2 years ago

AIs are only good at selecting mixed strategies, and that is where they make the difference in dominating high-level human players. However, if they are dragged into a simple fistfight, the difference in power directly leads to victory or defeat. In terms of power, KataGo is not that strong.

lightvector / KataGo

Some ideas about finding the best step in training #675