sanderland / katrain

Improve your Baduk skills by training with KataGo!
Other
1.58k stars 223 forks source link

AIs calibrated to kyu/dan strength with easier to understand settings #44

Closed sanderland closed 4 years ago

sanderland commented 4 years ago

Current options are rather mathematical, calibrating some settings -> kyu/dan rank and a slider that sets them would improve usability.

bale-go commented 4 years ago

Katrain is an amazing piece of software. Having weaker opponents is a major selling point in my opinion. I wanted to try out how the weaker AIs fare against GnuGo 3.8, which is 8 kyu. ScoreLoss uses visits, which makes it slower and its strength depends on the max_visits. I figured it would be better to have a policy based method in this kyu range. I opted for P:Pick since it seemed to be the most straightforward way to adjust the strength.

I set up a GnuGo AI with level 10 strength (8 kyu) as the opponent. My goal was to find settings where the game is even between the two AIs from the initial play to the endgame. With the default P:Pick settings (pick_override: 0.95, pick_n=5, pick_frac=0.33) katago was still too strong. I ran several games. In the beginning P:Pick always gained a huge advantage. However in the endgame P:Pick made obvious blunders that a DDK would clearly spot.

GnuGo 3.8 was black, default P:Pick was white, komi = 6.5 BGnuGo-WDefaultPPick BGnuGo-WDefaultPPick2

Interestingly, in the beginning pick_override of 0.8 was enough to get rid of the obvious blunders, but in the endgame 0.6 value was needed. To account for the phenomenon, I changed pick_override to 0.8 and changed line 56 to decrease it over the game: (ai.py)

    elif policy_moves[0][0] > (ai_settings["pick_override"]*(1-(361-len(legal_policy_moves))/361.*.5)):

This needed an earlier definition of legal_policy_moves (I put it it line 46). After the patch, no more obvious (meaning DDK) blunders were seen. However P:Pick still seemed to be stronger in the beginning, then in the endgame. So I decided against decreasing the number of moves considered. (Originally it gets smaller by the decrease in legal_policy_moves)

I wrote a little script to iteratively find the parameter (total number of moves seen by katago) that gives even games with different strengths of GnuGo. At least 10 games were played to estimate the parameter. GnuGo 3.8 at level 8 (ca. 10 kyu): 24 GnuGo 3.8 at level 10 (ca. 8 kyu): 30 GnuGo 3.8 at level 10 with 4 handicap stones and 0.5 komi (ca. 4 kyu): 49

A simple linear regression gives: (total number of moves seen by katago) = -4.25 kyu + 66 This equation - with the changing pick_override setting - might be used to make the AI strength slider.

With above listed changes I could get an even game with GnuGo at various strengths. I also tried it at my level, and the games were very enjoyable. GnuGo even when given handicap stones is too defensive. Modified P:Pick was much more creative.

_Black: GnuGo 3.8 level 10 White: Modified PPick (elif policy_moves[0][0] > (0.8(1-(361-len(legal_policy_moves))/361..5))) total number of moves seen by katago: 30_ BGnuGo-WmodifiedPPick

sanderland commented 4 years ago

@bale-go amazing work! If you were setting up these matches manually, you may be interested in https://github.com/sanderland/katrain-bots for the self-play scripts and gtp-esque connectors.

One issue I have with formulas like this is deciding which parameters are exposed to users and how to expose or explain them to users -- or hide them alltogether.

bale-go commented 4 years ago

Thank you for the suggestion about katrain-bots. I used PyAutoGUI to automate the games. I wanted to test the modified P:Pick for stronger bots. I opted for the open source Pachi. "pachi -t =5000 --nodcnn" is 3k "pachi -t =5000:15000" is 3d currently at KGS.

More than 10 games were run for each bot. After finding iteratively the correct parameter (total number of moves seen by katago), the games were quite balanced, without any serious blunder. None of the bots had an extra advantage at the beginning, middle or the end of the game.

_Black: pachi -t =5000:15000 (rank=3d) White: Modified PPick (elif policy_moves[0][0] > (0.8(1-(361-len(legal_policy_moves))/361..5))) total number of moves seen by katago: 115_ BPachi3d-WModPPick

Even games with different bots at different total number of moves seen by katago.

GnuGo 3.8 at level 8 (ca. 10 kyu): 24 GnuGo 3.8 at level 10 (ca. 8 kyu): 30 GnuGo 3.8 at level 10 with 4 handicap stones and 0.5 komi (ca. 4 kyu): 49 pachi -t =5000 --nodcnn (3 kyu): 66 pachi -t =5000:15000 (3 dan): 115

Linear regression did not give a good fit in this wider rank range. Theoretically it makes more sense to use the log10 of the total number of moves seen. That way it is not possible to have negative number of seen moves.

regression

The equation: (total number of moves seen by katago) = int(round(10*(-0.05737kyu + 1.9482)))

The equation works for ranks from 12 kyu to 3 dan, which covers more than 90% of active players. It should be noted that since there is no 0 kyu/dan, 3 dan = -2 kyu This equation - with the changing pick_override setting to (0.8(1-(361-len(legal_policy_moves))/361..5)) - might be used to make the AI strength slider for 90% of players.

The equation has an other nice feature. Extrapolating the line gives ca. 10 dan for perfect play, where total number of moves seen is the size of the go board (361).

One issue I have with formulas like this is deciding which parameters are exposed to users and how to expose or explain them to users -- or hide them alltogether.

I think it would be nice to have a simple view, where one could set the kyu level directly. Maybe a tournament mode could be added later, where one starts at a user set rank. When human players win, they gain a rank, when they lose they lose a rank.

sanderland commented 4 years ago

any way you could have a look at this for p:weighted and scoreloss? I think they're my prefered AIs and curious how they perform on blunders in early vs end game.

bale-go commented 4 years ago

The reason I did not use scoreloss, is that it heavily depends on the max_visits and it is much slower.

Theoretically, I find the approach of P:Pick better. The value of the NN_policy seems rather arbitrary in many cases. One can see this by comparing the values of the score_loss and the NN_policy for a given position. The absolute value of NN_policy does not directly reflect the score_loss. For example, NN_polciy(best) = 0.71 and score_loss(best) = 0 points; NN_polciy(second_best) = 0.21 and score_loss(second_best) = 1 points. However, I found that the order of the moves from the best to the worst is very similar for score_loss and NN_Policy. P:Weighted relies on the absolute value of NN_policy. P:Pick relies on the order of the moves of NN_policy. The latter is more robust.

sanderland commented 4 years ago

The compute cost and visits conflation is definitely an issue. However, a major weakness of pick over weighted is being blind to 'all' good moves on a regular basis and playing some policy <<1% move, at which point the ordering is quite random.

bale-go commented 4 years ago

I guess what I try to argue here is that having a policy value less than 1% is not a problem per se.

If you check amateur human vs. human games, there are multiple less than 1% or even less than 0.1% moves. The obvious blunders can be removed by using a shifting pick_override setting (80% initially to 50% endgame). I looked in the logs of modified P:Pick with gnugo. total number of moves seen was 30, a seemingly low value. But P:Pick did not make clearly bad moves (that gnugo could take advantage of). Only 8% of the moves were below 0.1% and all of them were in the first 25 moves, where the decrease of NN_policy from best to worst is steep. Like placing a stone on the 4th line instead of the 3rd line could result in a NN_poicy << 1%. Only one third of the modified P:Pick moves were less than NN_policy=1%.

In the end the user experience is the most important. The runs with different bots show that the modified P:Pick policy makes a balanced opponent for a wide range of ranks. I guess you might add a condition to remove NN_policy < 0.1%, but I think humans around 10 kyu make those too from time to time.

sanderland commented 4 years ago

<<1% is more like 0.1%, which more often is problematic (first line moves and such). Anwyay, could you PR this as a new AI option into the v1.1.1 branch? If we have them side-by-side we can see where it leads.

bale-go commented 4 years ago

This is the first time I use GitHub (I only registered to participate in this fascinating project). I will try my best.

sanderland commented 4 years ago

refactored a bit after the merge and added tests since it was turning into quite the spaghetti. It went all the way to losing by 80 points to near jigo against p:weighted and looks nice -- what bounds do you think there are on the rank? Will see about running a couple of ogs bots on this and see where their ranks end up.

bale-go commented 4 years ago

The upper limit currently is the strength of the policy network, around 4d. I played with it at 20k to check if everything is working at lower strengths. It seemed to play like a beginner as expected. But I do not know of any bots that play in that range to test the balanced play from opening to endgame, like I did with the 3d - 12k range bots. Running ogs bots at different kyu settings (maybe? 8k, 5k, 2k, 2d) is a great idea. Let's see some real life data.

sanderland commented 4 years ago

got them working on ogs -- seems to work nicely, but it really shows how bad local is after adding an endgame setting to it!

image

sanderland commented 4 years ago
 * ai:p:rank(kyu_rank=2): ELO 1326.7 WINS 249 LOSSES 39 DRAWS 0
 * ai:p:territory(): ELO 1302.0 WINS 185 LOSSES 65 DRAWS 3
 * ai:p:tenuki(): ELO 1276.5 WINS 192 LOSSES 60 DRAWS 0
 * ai:p:weighted(): ELO 1156.7 WINS 206 LOSSES 81 DRAWS 1
 * ai:p:local(): ELO 1106.5 WINS 130 LOSSES 162 DRAWS 0
 * ai:p:pick(): ELO 1044.6 WINS 125 LOSSES 127 DRAWS 2
 * ai:p:rank(kyu_rank=6): ELO 1026.3 WINS 167 LOSSES 120 DRAWS 2
 * ai:p:rank(kyu_rank=10): ELO 761.2 WINS 86 LOSSES 202 DRAWS 0
 * ai:p:rank(kyu_rank=14): ELO 582.2 WINS 45 LOSSES 247 DRAWS 0
 * ai:p:rank(kyu_rank=18): ELO 417.3 WINS 3 LOSSES 289 DRAWS 0
bale-go commented 4 years ago

Pretty cool! The estimated kyu_rank for ai:p:rank vs. ELO data changes linearly, a pretty good sign.

sanderland commented 4 years ago

image some spot on ranks there, though sample size is still small.

sanderland commented 4 years ago

after 320 games image

some real weird stuff in the weaker one though (e.g. 153/155) https://online-go.com/game/24495021

Dontbtme commented 4 years ago

Isn't Katrain just trying to start a capturing race in the corner? B18 makes an eye at A19 and c17 takes 1 liberty from White.

bale-go commented 4 years ago

I didn't think it would work out so so well. All of the ranks are within one stone, except for katrain-6k, which was still 5k in the morning. 18k bot is probably at the limit of the range of usefulness of this method. It is pretty surprising, that the 3d - 12k calibration worked so well at lower kyu.

I was thinking about using this method to assess the overall play strength of a player. I saw something similar in GNU Backgammon. It is possible to estimate your skill by looking at your moves. Currently the analysis mode can help you discover certain very bad decisions, but I think it might also be important to see the consistency of all of your moves.

I'm currently working on dividing the game in 50 move pieces, and calculating the kyu rank for each part of the game (opening (0-50 moves), early middle game(50-100 moves), late middle game (100-150 moves), endgame (150-)) by the median rank of the moves (best move is 1st, second bes is 2nd etc.). It could give you a feedback on which part of your game needs improvement. For example, I tested it on a few of my games: my opening is better than my rank by two kyu, but my late middle game is terrible (3 kyu weaker). What do you think?

sanderland commented 4 years ago

@bale-go I went a bit lower, since especially at those ranks people seem to looooove bots. interesting idea on ranking parts of the game. I'm not sure how indicative policy rank is (playing the #5 move could be dying horribly, or could be one of many equally good opening moves, right?) - it may be worth trying out some different metrics and see what makes sense. Still, median over 50 moves should stabilize it a lot.

sanderland commented 4 years ago

https://github.com/sanderland/katrain-bots/blob/1.2/sgf_ogs/katrain_KaTrain%20aiprank%20vs%20OGS%202020-06-06%2010%2049%2041_W+30.4.sgf

Move 153: B B18 Score: W+40.9 Win rate: W 98.8% Predicted top move was A17 (W+40.1). PV: BA17 B19 C19 Move was #11 according to policy (2.42%). Top policy move was A17 (18.1%).

AI thought process: Using policy based strategy, base top 5 moves are A17 (18.12%), F19 (13.48%), E16 (10.26%), A10 (8.22%), D18 (6.56%). Picked 8 random moves according to weights. Top 5 among these were B18 (2.42%), R7 (0.12%), S11 (0.01%), P15 (0.01%), T12 (0.00%) and picked top B18.

Move 155: B C17 Score: W+36.6 Win rate: W 98.4% Estimated point loss: 15.9 Predicted top move was F19 (W+17.9). PV: BF19 E16 Move was #38 according to policy (0.04%). Top policy move was F19 (25.0%).

AI thought process: Using policy based strategy, base top 5 moves are F19 (24.98%), H18 (24.70%), E16 (15.19%), L6 (10.72%), B13 (6.61%). Picked 8 random moves according to weights. Top 5 among these were C17 (0.04%), Q15 (0.02%), S3 (0.01%), P2 (0.01%), G10 (0.01%) and picked top C17.

didn't realize n=8 at this level, makes more sense now :)

bale-go commented 4 years ago

The success in covering a wide range of strengths with the policy pick method shows to me that it captures some important aspects of the difference in beginner and expert understanding of the game. In policy pick method the neural network is only used to rank the moves from best to worst (policy value is only used in weeding out really bad moves).

In line with the p-pick-rank method, it is not far fetched to assert - according to the bot calibration and ogs data - that a 3k player chooses the best move from ~60 possible moves (M). The total number of legal moves in an empty board is 361 (N). We can use statistical tools to show that the median of the rank of the best move (mbr) is: mbr = ceil(N/(sqrt(exp(-1))+(2-sqrt(exp(-1)))M)) = ceil(361/(sqrt(exp(-1))+(2-sqrt(exp(-1)))60)) = 5

In other words 3k players will find the 5th best move on average (on median ;) ) during their games.

But we can reverse the question. If we observe by the analysis of a much stronger player (katago) that the median rank of moves is 5 we can argue that the player is ca. 3 kyu. An important advantage is that this rank estimation does not need further calibration. If the median rank of played moves is 5, and the median number of legal moves is 300, it is possible to calculate how many moves does the player "see" (M ~ 60). We can use the calibration equation (total number of moves seen by katago) = int(round(10*(-0.05737kyu + 1.9482))) to calculate the rank.

As I mentioned earlier, we can use this method to evaluate parts of the game.

I wrote a script to calculate the ranks by this method. Here are two examples to showcase it. GnuGo 3.8 -level 10 (8 kyu) moves; rank 0-50; 5.5 kyu 50-100; 18 kyu 100-150; 10.5 kyu 150-end; 0.5 kyu

It seems that GnuGo developers did a terrific job with the opening (hardcoding josekis etc.) and the endgame, but the middle game needs some improvement.

pachi -t =5000 --nodcnn (3 kyu): moves; rank 0-50; 0 kyu 50-100; 1 kyu 100-150; 7 kyu 150-end ; 7 kyu

Pachi was ahead in the first 100 moves in the game with katrain3k, but it made a bad move and MCTS bots are known for playing weirdly when losing. The changing ranks show this shift.

Please let me know if you are interested in a PR.

sanderland commented 4 years ago

18k seems suspect, no? that's a huge rank difference. Then again, pachi doing well...is it just biased towards MCTS 'style'? Feel free to PR and we can see where this fits in. It may not make it in as easily as the calibrated rank bot, but it's really interesting to play around with and see how we can integrate it

bale-go commented 4 years ago

Indeed, 18k is a huge difference. In the long run, maybe it would be better to color code the parts of the game, similarly to the point loss. The calculated rank for the total game would be the reference. If a part of the game is much worse (worse than -5k) it would be purple; -2k to -5k red; -2k to +2k green (neutral); better than +2k blue (congrats!).

However, this scale would be independent of the score loss of individual moves. It would assess the overall quality of that part of the game. Due to the application of median the calculated ranks are resistant to outliers (blunders, lucky guesses etc.). Indeed, it could show that player A was better than player B in the quality of play, but player A made a blunder and lost the game.

sanderland commented 4 years ago

What do you think of a short text-based report at the end of a game to start with? It could go into sgfs and even be sent in chat on ogs

bale-go commented 4 years ago

I think that would be awesome.

I made two analysis of two recent games on ogs. katrain-6k(W) lost in the first game. It did not play at 6k level during the game. The rank consistency analysis could correctly evaluate the rank of elplatypus[9k] user. It seems that B is pretty good at joseki, but the endgame might need some improvement.

File name: elplatypus vs katrain-6k B.csv Player: elplatypus[9k] (B) Move quality for the entire game: 9 kyu Move quality from move 0 to 50: 3 kyu Move quality from move 50 to 100: 9 kyu Move quality from move 100 to 150: 5 kyu Move quality from move 150 to 200: 13 kyu Move quality from move 200 to the end: 14 kyu

File name: elplatypus vs katrain-6k W.csv Player: katrain-6k (W) Move quality for the entire game: 9 kyu Move quality from move 0 to 50: 9 kyu Move quality from move 50 to 100: 6 kyu Move quality from move 100 to 150: 10 kyu Move quality from move 150 to 200: 13 kyu Move quality from move 200 to the end: 9 kyu

katrain-10k(B) won the second game in a very close match (B+1.5). It played at ca. 7k level during the game. The rank consistency analysis showed that W played a really strong game. W was better than their rank over the entire game.

File name: katrain-10k vs LadoTheBored B.csv Player: katrain-10k (B) Move quality for the entire game: 7 kyu Move quality from move 0 to 50: 6 kyu Move quality from move 50 to 100: 12 kyu Move quality from move 100 to 150: 10 kyu Move quality from move 150 to the end: 5 kyu

File name: katrain-10k vs LadoTheBored W.csv Player: LadoTheBored[10k] (W) Move quality for the entire game: 7 kyu Move quality from move 0 to 50: -0 kyu Move quality from move 50 to 100: 9 kyu Move quality from move 100 to 150: 7 kyu Move quality from move 150 to the end: 8 kyu

sanderland commented 4 years ago

It's strange that the bots don't play at their level -- are you sure you're not off by some factor due to it being 'the best among n moves' and not 'this rank'?

bale-go commented 4 years ago

I think it is due to the underlying randomness of the p-pick-rank method. I tested the consistency analysis on some test cases. For example, when I take out randomness, by fixing the move rank of every single move to a certain number (this number will slowly decrease due to the decrease in the number of legal moves), the calculated kyu level did not change over the game. In higher ranks (lower kyus) the analysis becomes more noisy due to the application of median, which can only be integer (except for lists with even number of elements). I will upload the test cases, a gnumeric spreadsheet with the equations, and a small fix for the script.

File name: 12k_not_random.csv Move quality for the entire game: 12 kyu Move quality from move 0 to 50: 12 kyu Move quality from move 50 to 100: 12 kyu Move quality from move 100 to 150: 12 kyu Move quality from move 150 to 200: 12 kyu Move quality from move 200 to the end: 12 kyu

File name: 8k_not_random.csv Move quality for the entire game: 8 kyu Move quality from move 0 to 50: 8 kyu Move quality from move 50 to 100: 8 kyu Move quality from move 100 to 150: 8 kyu Move quality from move 150 to the end: 8 kyu

File name: 4k_not_random.csv Move quality for the entire game: 4 kyu Move quality from move 0 to 50: 4 kyu Move quality from move 50 to 100: 3 kyu Move quality from move 100 to 150: 5 kyu Move quality from move 150 to the end: 5 kyu

Dontbtme commented 4 years ago

What if the panel of moves to chose from wasn't random but located around policy moves instead? Then if a bot is able to rank these moves, wouldn't it find several that are representative to its own? If so, it could play said moves instead of the best one, and the bot would then be consistent through the whole game, wouldn't? I'm no programmer, just to be clear ^_^ I'm throwing this idea out here in case it has some merit.

sanderland commented 4 years ago

I think it is due to the underlying randomness of the p-pick-rank method.

Sure, but the rank is calibrated with this randomness, so the estimation is off.

'what is the average rank of a move when you pick the best of k moves out of n randomly is not something that seems immediately obvious.

bale-go commented 4 years ago

@Dontbtme I like the p-pick method since it provides a good way to randomize the play of a very strong player. It is a nice and clean method so we can use statistical tools to work with the data.

@sanderland This is something I needed to calculate as well. It is far from obvious :) I opted to use the good old Monte Carlo. I simulated every possible N number of legal moves (1 - 361) and M number of picks (1 - 361), registered the best over 100000 times each and calculated the median. I found the equation connecting N and M to the median of the rank of the best move (mbr) by symbolic regression: mbr = ceil(N/(sqrt(exp(-1))+(2-sqrt(exp(-1)))*M))

Here is an example: A 3k player chooses the best move from ~60 possible moves (M). The total number of legal moves in an empty board is 361 (N). We can use the equation to show that the median of the rank of the best move (mbr) is: mbr = ceil(N/(sqrt(exp(-1))+(2-sqrt(exp(-1)))M)) = ceil(361/(sqrt(exp(-1))+(2-sqrt(exp(-1)))60)) = 5

sanderland commented 4 years ago

I asked in the math problems slack channel at work

when picking k values s(0)..s(k-1) without replacement from (0..n-1), what is the expected value of min(s)

The answer

(n-k)/(k+1) There are n-k values that are not picked, and they are distributed in (k+1) buckets, with each bucket being the space between two consecutive picked values, or before the first, or after the last picked value. min(s) is then the number of non-picked values in the first bucket.

Seems correct! This is not the median of course, but expected median tends to be more difficult.

bale-go commented 4 years ago

Yes, median is always harder. Of course, we can use average too, but then I would remove the best and worst 20% to get rid of outliers. It might even help with evaluating stronger players, where median being an integer can cause huge errors (median rank of best move shifts from 2 to 3 means a huge difference in rank, while the average might just change from 2.4 to 2.6). Anyway, I ran 25 games between two 6k p-pick-rank bots. The estimated ranks are not significantly different from the nominal (6k). 6k-mean-stdev

bale-go commented 4 years ago

I rewrote the consistency analysis script for using average after removing the best and worst 20%. The estimated ranks did not change significantly, but the standard deviation decreased for the segments. 6k-mean-stdev-average

sanderland commented 4 years ago

Rewrote your script a bit to work on sgfs / use the engine. See https://github.com/sanderland/katrain-bots/blob/master/analyze_rank.py Overall inverting the rank to get kyu is hard, I'm getting some really high numbers for this example.

bale-go commented 4 years ago

I found the problem. The best rank for the move in your version is 0. In mine it is 1. You can fix it by changing the n_moves to following.

n_moves = math.floor(0.40220696+averagemod_len_legal/(1.313341*(averagemod_rank+1)-0.088646986))

That particular example game is really interesting. Both players choose the highest policy move like a pro :) Looking at the games of p-weighted, I think it suffers from the same problem as I discussed in my first post for p-pick (0.95, 5, 0.33). It is really strong in the beginning, it builds an advantage, and then blunders sometimes in the end. You can see this pattern in "katrain_katrain-weighted (6k) vs Jaafar (7k) 2020-06-09 15 37 59_B+44.8.sgf" or "katrain_biz20 (9k) vs katrain-weighted (6k) 2020-06-09 15 38 06_W+54.6.sgf" And the score_graph tells the same story (L7 in the latter game).

After the n_moves patch, the rank of human players and p-pick-rank bots were estimated pretty well.

P.S.: Unfortunately, the nice and simple calculate_rank_for_player_alternative_try function won't work, since we remove the best and worst 20%, I needed symbolic regression to get the n_moves.

sanderland commented 4 years ago

Thanks for the PR!

P.S.: Unfortunately, the nice and simple calculate_rank_for_player_alternative_try function won't work, since we remove the best and worst 20%, I needed symbolic regression to get the n_moves.

I didn't remove outliers in calculate_rank_for_player_alternative_try - and it did give some reasonable output for other games. It's interesting that p:weighted has this problem of blundering though, the SGFs should be very rich and show you why. I will run the bots for a while and get a few 100 sgfs to use now that I've hacked gtp2ogs to give me rank etc.

bale-go commented 4 years ago

You are right. Without removing outliers it should give the exact result. Please keep me posted on the ogs sgfs. I'm really interested in the results.

One thing I noticed is that human players are really good at the first 50 moves. I think it is due to the fact that many of us learn josekis by hearth. It is possible to do it, since that is a relatively well defined part of the game. I would guess if we did not allow moves near the corner (making the joseki memory useless) we would get a play quality closer to the rest of the game. It would be similar to Fischer random chess. An other issue this analysis could help is cheating with the help of AI. The analysis could provide some clues.

sanderland commented 4 years ago

yep, and in endgame the number of moves is small and the best one can be really obvious, so calling someone a 9d over doing hane-connect a few times is also tricky. just pushes 130+ games (people love the new bots it seems!)

sanderland commented 4 years ago

https://online-go.com/game/24605193 This user reported the 2d version as '2d is super week. even if I lose all the time I'm able to be ahead during the game which is innapropriate for 2 d. mistakes in 33 joseki with attachement is always got trict. and also does not see ladders at all'

Dontbtme commented 4 years ago

This is a known blindspot from Katago which is why, in one of his releases, Lightvector "Added option avoidMYTDaggerHack = true that if added/enabled in GTP config, will have KataGo avoid a specific opening joseki that current released networks may play poorly." That being said, I've no idea how Katrain's default 15block network would react to this hack, plus some experimental networks have been recently released, after some training on known blindspots such as this one, although they are all 20 blocks networks or bigger. (By the way, regarding the 15 block network, it has stopped being trained a long time ago while the 20 block network continued training on self-played games from 30 and 40 block networks. Also, I'm not sure PDA was a thing yet when the 15 block network was still trained, since it'll always invade at 3-3 as White first move in a 9 handicaps game, for example)

bale-go commented 4 years ago

@Dontbtme Interesting! I didn't know that these blunders didn't get weeded out during the training process.

I downloaded my game record from KGS and analysed the 19x19 games with the script. It is amazing to see the progress over the years. I can even see when I started to take learning joseki seriously :)

sanderland commented 4 years ago

@bale-go that's a very complicated joseki created by a pro as a high-level trick sequence. not too surprising it doesn't come up like that. after a fix: image

bale-go commented 4 years ago

I assume that the plot shows the ogs median/average of the rank of moves vs kyu. It looks really good. Could you please show the outlier free average too (20%-80%)? But the most interesting would be the nominal kyu rating vs. the one calculated by the script.

sanderland commented 4 years ago

image

have a look at https://github.com/sanderland/katrain-bots/blob/master/analyze_games.ipynb

bale-go commented 4 years ago

Very nice! This is a pretty strong evidence that the average move rank correlates with the kyu rank. And the shape of the "average move rank vs kyu rank" curves seems to be the same as the one in the script. In [75] subplots seems to be the same.

The histograms are amazing. It seems that users make more >40 rank moves, but the overall shape is pretty similar between bots and users.

I would expect that the "nominal kyu rank vs script calculated kyu rank" plot will be quite linear for the bots. It has to be due to the definition of p-pick-rank bot. And I guess that the same plot for users will be similar.

bale-go commented 4 years ago

I analysed the the sgf_ogs directory. I included only 19x19 games with at least 100 moves.

P-pick-rank bots scatter and density plots. bots_scatter bots_density

Users scatter and density plots. users_scatter users_density

Kyu ranks of bots are really well estimated as expected. The calculated user ranks are estimated well between 20k and 1d. In that rank region the most games are on the theoretical y=x line (darkest squares in the density plot).

sanderland commented 4 years ago

Nice work! It does seem to be a noisy business though, so not sure if the 'segments' are long enough to work well. More inspiration from the LZ discord: https://cse.buffalo.edu/~regan/papers/pdf/ReHa11c.pdf

bale-go commented 4 years ago

Nice find! I was just thinking about writing a paper on the results. ;) "Emulating Human Play and Assessing Ranks in the Game of Go"

I made a 2D kernel density plot to see the distribution better. users_ 2D_kernel_density

It would be possible to add confidence interval to the short text message. Like: Move quality from 1 to 256: 4.5 ± 1.0 Move quality from 1 to 50: 2.5 ± 2.0

sanderland commented 4 years ago

Nice find! I was just thinking about writing a paper on the results. ;) "Emulating Human Play and Assessing Ranks in the Game of Go"

Sounds like a plan, my h-index needs a bump ;)

it does seem the fit could be improved - the ogs 2d bot is >2d and the 18k bot is now at 23k or so, is that consistent with that density plot?

Dontbtme commented 4 years ago

If I may, OGS is peculiar as far as ranks are concerned, since it mixes live and correspondence games. I know some people play correspondence games like they would play blitz games, meaning without any pondering/reading anything, but I think it's fair to say that most people play way better moves in correspondence games than they do in live games. In my case, I wouldn't be surprised if I were 3 stones stronger in correspondence games compared to live ones. Anyway, my point is that you'd get probably a lot more consistent results between ranks if you were to pick games from a server like KGS instead, imho