sanderland / katrain

Improve your Baduk skills by training with KataGo!
Other
1.62k stars 227 forks source link

Performance throughout the game overview/rating #388

Closed sektorate closed 3 years ago

sektorate commented 3 years ago

Hello, many thanks for this amazing trainer! Feature suggestion: I imagine each move being placed in categories (e.g. blunder, mistake, inaccuracy, okay, excellent, best move) based on the percentage change in winrate it effects. The percentage ranges for these categories could be user definable. An overall "accuracy" score out of 100 could then be generated for each player based on the percentage of their moves the engine rates as best. These ideas are inspired by the analysis features of chess.com, that give an overall insight into the players' performance in a game; this would supplement analysis of each individual move. Thanks again for your work, I'd love to hear your thoughts. Screenshot (2)

sanderland commented 3 years ago

Some discussion of this in #340 - it's a bit strange to suggest adding categories when they already exist though :)

sektorate commented 3 years ago

You're right, of course!

sanderland commented 3 years ago

Some experiments on this in the 1.9 branch, starting with a 2d table of point loss vs policy. Could have an option for policy <-> ai order as well, but this is simpler for now. Space for lots more, maybe some summary stats.

@pdeblanc thoughts?

Random dan game:

image

Random ddk game:

image

sanderland commented 3 years ago

1d layout instead of 2d

image

sanderland commented 3 years ago

idea by marcel image

sektorate commented 3 years ago

On first impression, 1d layout seems more easily readable than 2d. Some questions to aid iteration:

-Is it possible to create an overall performance score? e.g. if every move lost no points/was the AI top move this gives a 100% accuracy, if every move lost more than 12 points/was the AI worst move gives 0%. This would provide extra insights: "did I win because I played well, or because my opponent played terribly?", "I lost even though my performance was good so this loss isn't so bad".

-Is it possible/would it be useful to combine the points lost/AI rank into one overall metric? An overall score of each move would be more concise.

Overall it would be great to give the user control over the level of granularity, i.e. whether 2d or 1d, which stats are shown; whether the moves are shown graded separately by points lost/AI top move or whether these are combined into one metric etc.

I hope these thoughts help, I love this feature already and am excited to use it.

Eric-Wainwright commented 3 years ago

The Chess.com site really did a nice job with their game report. However, they probably have 100+ developers working for them. Here's a simple version of their accuracy report. It would allow users to customize the category names in the Teaching/Analysis settings. Instead of a separate Game Report, this could also be just another tab, unless you're planning to add additional information in the future.

The accuracy stat would be a weighted average of the categories. Ideally, this accuracy information would update as users moved through the game tree, not just at the end of the game.

Note that all chess apps and sites (that I've seen) use strictly board evaluations for computing mistakes, i.e. top move - actual move. There's no reporting done on how much a move improves the prior position, i.e. actual move - prior move. We've discussed this before.

P.S. I don't know how useful the 2D performance table would be. It seems more like a curiosity rather than helpful information. But, I guess it might show how well your intuition (policy) is working versus your calculation (tree search). The accuracy information seems more helpful.

image

image

sanderland commented 3 years ago

image

movecomplexity = sum(policy over candididates) - sum(policy over candididates with point loss <= 0.5) i.e. what policy % is bad moves the ai thought were worth considering. complexity = average movecomplexity weighted_loss = point loss weighted by min(movecomplexity ,0.25) for dark green moves, or 0.25 for mistakes, i.e. trying to downweigh obvious moves accuracy = 100 * 0.75**weighted_loss

formulas aren't great yet. but I like the layout and fields midgame is just move 50-150, which is also not perfect...

sektorate commented 3 years ago

^ this looks great! being able to focus on one stage of the game is an excellent idea.

Eric-Wainwright commented 3 years ago

Sander, I like it! Much improved over my version. :-)

I'm working on trying to understand your formulas. What is the cutoff point for the AI candidates, or is this determined by max_visits? Wouldn't higher visits skew the complexity rate upwards (more poor moves searched?)

The accuracy formula seems reasonable. I want to research to see how some of the chess apps do it.

I think the colors on the Teaching/Analysis settings should be re-ordered to match this for consistency.

Good stuff!

sanderland commented 3 years ago

it's using all candidate moves returned by the ai, this is of course influenced by visits, root noise, etc. the idea is that the sum of policy priors over low-pointloss candidates represents 'obvious good moves' and the remainder 'nice looking but not good moves' and your move should be considered more important if the latter is big.

I am not convinced this is the best approach, but it's the first thing that kind of did something reasonable.

sanderland commented 3 years ago

I think the colors on the Teaching/Analysis settings should be re-ordered to match this for consistency.

c2ca285

Eric-Wainwright commented 3 years ago

I did some quick research on Go and Chess apps, and the only one that seems to calculate a game accuracy is Chess.com. They call theirs "Computer Aggregated Precision Score" or CAPS. It's a proprietary model that incorporates game mistakes and other "pattern of strength" algorithms. In other words, it is a black box.

There's some controversy in the forums about how well it works. Apparently, the statistic can vary widely over games, and it does not give a great predictive power into the rank of a player.

Links here and here.

As for complexity, I think your idea has merit. I spent some time studying L&D problems earlier trying to understand why some were more complex than others. The number of reasonable-looking branches in the search tree has mostly to do with it. Whether you can tease this information out of differences between policy priors and search results will be interesting.

This feature may take lots of thought and testing. I vote to roll out something simple and get feedback on it. Maybe create a beta version that we can do some testing on.

sanderland commented 3 years ago

This feature may take lots of thought and testing. I vote to roll out something simple and get feedback on it. Maybe create a beta version that we can do some testing on.

I generally don't hide things, it's in branch and anyone can test it. Releasing is a lot of work though, and the last time I released for feedback I got zero comments, soooo

sanderland commented 3 years ago

testing another weighting in 3adde54 this time based on the policy-weighted point loss

xiaoyifang commented 3 years ago

image this part can be replace with a chart like this. more intuitive image

sanderland commented 3 years ago

want to test this a bit more properly. if someone could help collect a nice test set that would be appreciated:

A variety of around 50-100 sgf games from 15k to 7d. 19x19, at least 200-250 moves played. Should have the BR and WR fields set (as in e.g. ogs)

Eric-Wainwright commented 3 years ago

I'll commit to scraping 50 games from OGS spread across 15k to 7d. Does it matter if they're even or handicap?

sanderland commented 3 years ago

Shouldn't matter. The idea is to see the numbers by player rank more systematically

Eric-Wainwright commented 3 years ago

Here are 30 OGS games ranging from 9k to 4d.

10 OGS games 1d-4d.zip 10 OGS games 1k-4k.zip 10 OGS games 5k-9k.zip

sanderland commented 3 years ago

code as in dde545b (weighted by complexity ~ expected point loss if playing candidates with p=policy) 40b 7.9G @ 500 visits

image

data: https://pastebin.com/k44TYjY9

accuracy seems ok, a bit weird on 2 outliers. complexity is a bit all over the place, may just remove it.

Eric-Wainwright commented 3 years ago

20 more OGS games

10 OGS games 4d-7d.zip 10 OGS games 10k-15k.zip

sanderland commented 3 years ago

with new games included. r^2 added and added 'ai approved' stat (move in top 5 and pt loss <0.5, could have a better name) image

sanderland commented 3 years ago

Flat weights (i.e. not trying to downweigh obvious moves to reduce effect of opening/endgame) image

suggests the weighting does something useful at least!

sanderland commented 3 years ago

old idea:

good_move_policy = sum(d["prior"] for d in filtered_cands if d["pointsLost"] < 0.5)
etc

image

definitely worse

sanderland commented 3 years ago

adj_weight = max(0.025, min(1.0, max(weight, points_lost / 5))) image

worse?

sanderland commented 3 years ago

best result for now, as of 6a71266 will remove complexity as a stat as it's more about early/mid/endgame

image data: https://pastebin.com/raw/GsURUGSa

keeping this unless there's any bright ideas

Eric-Wainwright commented 3 years ago

The accuracy stat r^2 is looking pretty good. The complexity stat will need some more thinking.

'ai approved' stat (move in top 5 and pt loss <0.5, could have a better name)

Agree that 'ai approved' is a bit awkward. If you had category labels for pt loss, you could use the label. Not sure you need to limit it to top 5, but just the pt loss range seems good enough.

Let me know if you need more games for testing.

sanderland commented 3 years ago

The accuracy stat r^2 is looking pretty good. The complexity stat will need some more thinking.

Will probably just kill the complexity stat.

'ai approved' stat (move in top 5 and pt loss <0.5, could have a better name)

Agree that 'ai approved' is a bit awkward. If you had category labels for pt loss, you could use the label. Not sure you need to limit it to top 5, but just the pt loss range seems good enough.

The idea is that top 1 is very network dependent, and no limit is very visits dependent, this should be less so (as seen by katago selfplay games ending up at near 100%)

Let me know if you need more games for testing.

I think this is a very nice data set. If you can figure out what's up with the 1-2 outliers that might help though!

Eric-Wainwright commented 3 years ago

In the first outlier game (sunny25 vs lyq), both players missed the killing/saving of a group for many moves (163 - 202). This resulted in 20+ point loss swings for a significant portion of the game.

33352666-265-sunny25-lyq.zip

In the second outlier game (silent1 vs sunny25), both players missed a severe cut for many moves (49 - 126). Then, they missed a double sente endgame sequence for many moves (131 - 187).

33352369-209-silent1-sunny25.zip

sanderland commented 3 years ago

image

Thoughts on comparing the blunder classes to the other player compared to all moves? it's already proving confusing, may just remove the bars there.

sanderland commented 3 years ago

Also if someone has/can make a texture that helps make the bars look like bars, that could be nice! (as in transparency mask-only texture)

Dontbtme commented 3 years ago

Here's a mockup I made of what I think would be the most useful to review my games. It shows everything from each player's point of view, so no matter how well or bad White played, if Black played 104 green moves out of his own total of 152 moves, then his green moves percentage bar should be filled about 68%. Now, if instead we want to share stats between players for some reason (meaning 'Black played 80% of the red mistakes played in the game, which means White played only 20% of them'), then that could be an option toggled in the Teaching/Analysis Settings. That way we could get the best of both worlds. What do you think? Report-Mockup-Proposal

sente361 commented 3 years ago

Here is a suggestion for a more natural graphic for the Points Lost area:

image

Dontbtme commented 3 years ago

Even though a pyramid looks more natural from a graphical standpoint, the base may not be always the widest (for a double digit kyu maybe?). And I don't know, I feel we usually see best moves as "top" moves, visually speaking, maybe? In any case, I came up we another mockup that would combine Points lost seen from 1 player only's perspective as well as from both players'. How does it look? Report-Mockup-Proposal04

sektorate commented 3 years ago

This version of the panel seems a little cluttered. I'm not sure how useful seeing the proportion of points lost between each player is either, I personally would be interested mostly in the amount/% of my moves that fall into each category. If a user is that interested in the proportion between each player, they can compare the numbers. Having best moves shown at the top makes sense to me also.

Eric-Wainwright commented 3 years ago

This version of the panel seems a little cluttered. I'm not sure how useful seeing the proportion of points lost between each player is either. Having best moves shown at the top makes sense to me also.

^^^ Agree.

I think Sander has the cleanest layout. You don’t want to make it more complicated than this. The X-axis scale for each item can usually be inferred from the label, which is good.

I think showing the bars for all move classes is fine to achieve consistency. Except, I don’t understand the X-axis scale used in Sander’s version (what are the blunder lengths suppose to be?). I would think this should be the % of time you played that move class within the game.

Dontbtme commented 3 years ago

I'm also only interested in the % of moves that fall into each category (separately for each player). I only added the % sharing both players' data as a whole because that was what Sander went for initially, or so it seemed to me. Anyway, you'll see below my last mockup. This one is the closest to what I would go for if it was up to me. Your thoughts? Report-Mockup-Proposal05

sektorate commented 3 years ago

^ This looks great to me.

Eric-Wainwright commented 3 years ago

Are you arguing mostly about content or format? If content, then I agree that showing the % of moves in each category is best. The format could either be as a pie chart, a stacked chart (like yours), or a bar chart (like Sander's). (Although, I'm not sure what Sander was trying to show in his mockup :-)

As for format, a simple bar chart like Sander's is good enough for me, and it matches the style of the top section. But, I'd be Ok with either.

sanderland commented 3 years ago

I like the stacked chart, but it's a bit complicated to make, particularly hiding text dynamically. Here's back to % by category and colourful. image

As you see, 1 becomes basically 0 and many games just end up being greeeeeeeeeeeen.

sanderland commented 3 years ago

image lower level game

Eric-Wainwright commented 3 years ago

I like the latest iteration and certainly would be very happy with it. Although, I don't know why you wouldn't match the same style of bars in both sections:

image

sente361 commented 3 years ago

I prefer Sander's latest iteration; a microsecond glance - "I got mostly green - yippee!" (Coloring the bars makes the data leap out at you.) In the same vein, I would also color the bars in the Key Statistics section; a nice blue would look good.

sente361 commented 3 years ago

In my opinion it is simpler and easier to understand if the central "key" of the Points Lost section is presented in this form:

image

xiaoyifang commented 3 years ago

i think ,a better value can have a significant background color. such as , Mean point Loss, the less the better AI Top 5 ,the more the better. image at the same while ,avoid using too much color.

Dontbtme commented 3 years ago

In Sander's version I think the middle part should be less prominent and somehow detached from the bars left and right, cause to me it kind of seemed like every color was played a lot no matter what. I tried two mockups. They're not great but they'll show what I mean. The first one shows the most data. Report-Mockup210512a The second on shows the same amount of data than in Sander's Report-Mockup210512c In any case I don't think the Points Lost middle part should be as wide as the Key statistic one, as in Sander's the colors really seemed to have been played a lot even when say no red or purple mistakes where played

sanderland commented 3 years ago

image maybe better blue.

@Dontbtme sure that looks better, but keep in mind the whole thing is a single grid layout of labels, and the line is the bottom of the header cell. Give it a try and you'll see how difficult simple things can be in kivy ;)

Dontbtme commented 3 years ago

Still, as is, any bar looks big is what I meant. Can't you limit the colors in the middle around >0.5 etc. without changing the grid? Since colors in the left and right columns are only filling them depending on the %, why colors in the midle column have to fill it entirely? Colors in the middle are what's popping up the most in your picture, when we should be focusing on colors from the player's bars. I would even rather not having any colors in the middle colomn if that's too complicated, that way the mistakes's data colors would appear clearly and brightly on each players's column Anyway, that's only my two cents (although I'm not sure about the dark blue you switched to in the key statistics either, since the all around UI is already some kind of dark blue, but I digress) But anyway, if the above isn't convincing, then maybe that's just a matter of taste, in which case just ignore it and let's move on ^_^

xiaoyifang commented 3 years ago

image seems wrong value >100%