Add win percentages - Githubissues

hanabi / hanabi.github.io

A list of Hanabi strategies

https://hanabi.github.io/

Creative Commons Attribution Share Alike 4.0 International

162 stars 149 forks source link

Add win percentages #325

Closed ofekih closed 4 years ago

ofekih commented 4 years ago

I view this document as a good source of ideas of strategies for Hanabi, more than an actual guide. Without at least some estimates of win percentages (either for a fixed group size, e.g. 3, 4, or 5, or if you wanna be more complete for all group sizes), this is not very useful as a guide, at least for me.

How much more likely are level 7's going to win a game than level 5's? How likely are level 14/15's to win a game? I currently average around a 50-60% win rate with my group of 4 people, without this guide at all. I want to know if it's worth it to invest 240+ games with my group using these strategies if they're worse than our own, and there's no way of knowing how good this guide is without percentages.

ZakisanGithub commented 4 years ago

Commenting my own opinion as just a regular member of the hyphen-ated group: If you're interested in the power of the full conventions, it might be interesting for you to read the following, which explains that one of the more controversial conventions (3 bluffs, which are part of level 9 of the learning path) was "invented" to push the win rate of "Rainbow (6 Suits)" above 95% five years ago, such that I would expect current winrates of absolutely proficient players to be higher due to this addition and further optimization. https://github.com/Zamiell/hanabi-conventions/blob/master/misc/3_Bluffs.md

You can also take a look at the following game as a stunning example of proficient play of the full conventions (noting that it was part of a competition that involved turns taken as a tiebreaker): https://hanab.live/replay/194321

I don't expect win percentages to be added to the learning path because I highly doubt there will ever be reliable data on it (which would require players that are highly proficient at that level to play with only those conventions). Gathering data about a certain set of conventions is difficult anyways because it depends on the players ability to effectively use them.

For the learning path, it gets worse because it is usually played with some players that are not proficient with the contained conventions. I would expect a majority of "games which are lost while playing on the learning path" to be lost to mistakes in applying the conventions properly (rather than losing due to the conventions being too weak), which makes any data gathered from learning players relatively useless to estimate the power of the conventions. Once players become more proficient, they also usually play different variants of Hanab.

As a sidenote: As I understand it, the learning path is just a way to allow people to join the hyphen-ated group more easily, rather than different sets of conventions, so I don't think there is much interest in evaluating their win rates for members of the hyphen-ated group.

Zamiell commented 4 years ago

@ofekih Another good way to judge how good your group is compared to our best players would be to play in the biweekly competition on Hanab.live, where each team plays the same seed/deals:

https://docs.google.com/spreadsheets/d/1iMv5mfVv7PYxckPPRBAnJpXyTHACn_9HAMfF2YID16w/edit?usp=sharing

Depending on how hard you get destroyed, that will inform you on how good your conventions are. (Or, more accurately, conventions in combination with current skill level. It's going to be hard to disentangle those two things, as Zakisan smartly touched on above.)

ofekih commented 4 years ago

@ZakisanGithub Thanks so much for your response! A sample game would definitely help visualize which strategies seem worth it. Also about the 3-bluff leading to a 95% win rate, that's incredible! Wow. We don't play with variants, but I'll be sure to try it out and compare :) The best AI agents can only with with 4 players around 87% of the time, on the normal game, so I wonder how much this variant will help my group :P

On the topic of computer agents, that's how I was thinking win percentages could be generated. With a computer agent, it is relatively simple to add or disable specific conventions one by one, which would give very reliable data for the win percentages, or average score if people find that more useful. Since this project is on Github, I assumed that at least some of the contributors are avid programmers, I wouldn't have/be suggesting this otherwise.

Got it @ZakisanGithub. This project just became so big that a friend showed it to me to get ideas for conventions I could introduce to my group, and I thought that it would be awesome if there were also metrics. Because my group by itself came up with variations of these standards ranging from all over the level guide, but with many holes and many differences. It would be awesome to be able to compare more concretely.

Thanks a lot for your response again, I'll watch the sample game :)

@Zamiell Thanks for offering that we join your competition, but I've been playing with my family for years, and we don't play online. It seems like a great idea, however, would certainly help with my family's memory :P

Re: Depending on how hard you get destroyed: I'm personally not that capable of evaluating statistics like win percentages from just a few games. If my group gets 'destroyed', does that mean others win 10% more often than us, or do they win 20% more often than us? I have a hard time of imagining that it's more than 20%, unless some groups use the 'hat' strategy developed for computers which gives the aforementioned ~87% win rate with 4 players. I find that hard to believe though, since nothing similar is mentioned in your guide. My point is, I'm not that good at evaluating 'how hard' we get destroyed in a quantitative sense.

Anyways, I now realize that this guide was meant only for players specifically for this server. It is not meant as a 'guide of truth', 'optimal Hanabi strategy', it is simply meant as a 'follow these standards if you want to play on this specific server'. I see now that adding quantitative metrics would not matter for this purpose. I wish someone else would make as detailed as a guide for Hanabi with metrics so that it would be more useful for my group that is not intending to play online on any specific server.

Cheers, and congratz on a great guide, closing this issue

NoMercyO commented 4 years ago

Everything upto and including level 8 is fairly foundational to how we play, and a fairly simple and effective convention framework. If the document is too long and you want a hard and fast rule i would say level 8 is the most important hump. Bluffs are too good to not use. The only reason they are not introduced lower down is to help new players gain familiarity with layered finesses first and properly help them know how to distinguish between the two.

With near perfect execution of everything up to level8 in no variant (which i am assuming you mostly play?) I would estimate you should expect around 90-95% winrate, with 99% 24 or 25. Most of our veteran players will not make level8 or below mistakes, maybe a up to 5% errors, and only some small percent of those would be score impacting errors.

They are quite robust conventions for 3, 4, or 5 players, though probably very slightly less tuned for 3 player than the other two.

Everything from level 8-14 is either common to use for experts but complicated to learn and avoid simple errors, or uncommon but simple and useful tools to add overtime.

Everything a level 15 is quite rare, ranging from relevant one in 20 games to on in 500. If you are focused on winrate in simple game variants much of this tech is uneeded.

As other have said, competition games are the best way to test performance under ideal conditions. They are every 2 weeks and welcome to anyone. They aim to select accessible game modes and various player counts. Id also be happy to glance over a few of your games (cannot find your username on the website to look at replays). New ideas from other convention sets are also great to see. There are certainly some foundational convention sets that can be different at their core and superior to ours for specific game modes (a clue framework like beri's that is hugely dependent on clue timings is very powerful but prone to human error), but in my experience few other groups have delved as deeply into the potential of their convention framework to etch out the extra percentages.

Zamiell commented 4 years ago

With a computer agent, it is relatively simple to add or disable specific conventions one by one, which would give very reliable data for the win percentages, or average score if people find that more useful. Since this project is on Github, I assumed that at least some of the contributors are avid programmers, I wouldn't have/be suggesting this otherwise.

It would be incredibly difficult to program a robot to play in the same style that we play in the Hyphen-ated group. Such a robot would need something akin to "common sense". Please read this section, which explains this concept more clearly: https://github.com/Zamiell/hanabi-conventions/blob/master/Reference.md#context

I'm personally not that capable of evaluating statistics like win percentages from just a few games. If my group gets 'destroyed', does that mean others win 10% more often than us,

The rules of the competition are discussed in the Google Docs link, so maybe you skipped over it. Teams are scored based on how well they score over the course of 4 deals, when compared directly to all of the other teams. This is the correct metric with which to compare two teams (or two sets of conventions), not "win-rate", as you seem to be fixated on.

Anyways, I now realize that this guide was meant only for players specifically for this server.

To be clear, the Hyphen-ated group is only a subset of the people who play on Hanab.live. You can use the platform to play with your own family or anybody else. We organize games on Hanab.live between members of our own group (e.g. Hanabi enthusiasts willing to play at a high level and push the game to the limits).

Zamiell commented 4 years ago

'optimal Hanabi strategy'

Maybe you are not aware, but the optimal Hanabi strategy is already known: it is called Hat-Guessing. If you want to play with it, follow this guide: https://github.com/Zamiell/hanabi-conventions/blob/master/misc/Hat_Guessing.md We don't generally play with it because it is not very fun.

NoMercyO commented 4 years ago

Regarding computer agents: It is theoretically feasible but dont forget conventions are not a complete framework for a AI to play the game. Evaluating the meaning of a clue, vs evaluating the merit of what action to perform are very different. This is why the AI are still lagging behind human agents when trying to execute human conventions.

Furthermore the best AIs in development are focused on 2 player as it is computationally more manageable, but are also utilizing non human translatable conventions, either derived from hatguessing, or naturally derived from self play. Noone to my knowledge has successfully developed an AI that is 3+ player and is capable of adapting to human conventional ideas. It is being worked on though, as AI empathy (theory of mind) is a huge space of interest and hanabi is a great simple test environment for it.

I believe your numbers for AI performance are 87% are on the low end of 3+ players. See this https://github.com/WuTheFWasThat/hanabi.rs. It is quite outdated but was quite strong for a conventional AI (vs a self taught AI).

ofekih commented 4 years ago

@NoMercyO Wow, I was basing my estimates on deepmind's paper (https://arxiv.org/pdf/1902.00506v1.pdf), which is disappointing in how off it is.

@Zamiell I mentioned the 'hat' strategy in my answer, it's cool to see that you detail it in this guide, I missed that.

@Zamiell **This** is the correct metric with which to compare two teams (or two sets of conventions), **not** "win-rate", as you seem to be fixated on: I understand that this is how you compare groups in this club, but this is clearly not the only way. Why not 5 deals or 10 deals, etc. I have read several different papers on Hanabi, and most of them use the average game score as a metric, while some of them use win percentages (i.e. the aforementioned deepmind paper). As I mentioned in my first comment (second paragraph), I think average score is also a good metric, not just win percentages. I was just thinking selfishly, and my group cares more about perfect wins. Yet playing 4 games and comparing the results is hardly the only way to measure average score and outside of this specific group is unlikely to be viewed as the 'correct' way, as I'm sure you will agree.

Thanks @NoMercyO that's very kind and helpful. As I mentioned I (currently) only play with my family and we were just looking for more ideas for standards when we stumbled upon this. That's why you won't find me anywhere online. Thanks so much for your offer, I might take you up on that some day :)

@NoMercyO thanks for informing me about some of the AI nuances I didn't know about, time to read some more papers haha

I doubt any of you care about my opinion, but from reading the above comments and responses, it's clear to me that this group has some very kind and caring members. I will say that @Zamiell could use a bit of work on showing respect ;)

Cheers, thanks for all your help

padiwik commented 4 years ago

Anyways, I now realize that this guide was meant only for players specifically for this server.

There are groups who have adopted strategies from the Hyphen-ated conventions and use some of the strategies when playing with their friends, even in places other than hanab.live.

Zamiell commented 4 years ago

Yet playing 4 games and comparing the results is hardly the only way to measure average score and outside of this specific group is unlikely to be viewed as the 'correct' way, as I'm sure you will agree.

I think the optimal way to compare two different teams of robots would be to generate, say, 10 million deals using a certain set of incrementing seeds, and then have both sets of robots play all 10 million deals, and then take the average score. This is the method used in some papers, and if I am not mistaken, it is the method used by Florrat to benchmark his bot (the current best-known Hanabi bot in the world).

However, (I think) the point of your thread is to compare humans, not robots, so this scheme isn't relevant.

When I said the "correct" way, I of course meant a "more correct" way. Anyone outside our group, when tasked with coming up with the "best" way to have a Hanabi competition between N teams of human players, would likely come up with a similar scheme to what we have already come up with. At the very least, they would come up with something more sophisticated than "rate of perfect scores" - if only for the fact that if win rates are above 90%, then logistically you would need to play many, many games to find granularity between the best teams using this crude metric.

Certainly, comparing 5 deals, or 10 deals would be better than comparing 4 deals, but for the logistical reasons we want to limit it to 4, because people have real lives.

Anyways, I just re-read some of my above posts and I'm not sure where I was coming off as rude, but of course that was not my intent.

aliblong commented 4 years ago

Just adding a couple of things for @ofekih 's benefit:

The two optimization criteria that have been identified, maximizing average score and maximizing probability of maximum score, both create fun gameplay patterns. The competitions actually measure something slightly different than either, which is how well a team can place against other teams on average. There is a nonzero amount of metagaming that can be done to optimize this particular criterion, and the problem can be modeled as finding the optimal tuning between the two aforementioned criteria, which are already quite highly aligned. I highly recommend that you and your friends/family participate in our competitions, first and foremost because it's fun to do so, but also because I think you're overestimating the sample size needed to get a good read on how your conventions stack up against those outlined in this repo.

I now realize that this guide was meant only for players specifically for this server. It is not meant as a 'guide of truth', 'optimal Hanabi strategy', it is simply meant as a 'follow these standards if you want to play on this specific server'

It's a guide to effective Hanabi conventions, and not limited to use in a particular community or play environment (although these conventions are probably harder to use in physical play, just because of notetaking limitations). Expecting it to be a 'guide of truth' or 'optimal Hanabi strategy' is a bit like expecting scientific models to represent an absolute truth.

Maybe you are not aware, but the optimal Hanabi strategy is already known: it is called Hat-Guessing.

Hat guessing strategies are powerful in certain types of Hanabi rulesets, and these strategies can be fine-tuned to a degree that is comparable to the conventions in the main reference. Implying that hat is a monolithic strategy which is theoretically optimal is a gross oversimplification.

We don't generally play with [hat] because it is not very fun.

Like some other points made in this thread, this should be understood to be an opinion, and certainly not one shared by everyone in the group. In fact, I and several others were playing hat for fun last night. It produces gameplay with plenty of interesting decision points.