hanabi / hanabi.github.io

A list of Hanabi strategies
https://hanabi.github.io/
Creative Commons Attribution Share Alike 4.0 International
163 stars 155 forks source link

Convention/Framework Proposal: Dawn #589

Closed pianoblook closed 2 years ago

pianoblook commented 3 years ago

We've had great success with these changes the past few months, and were recently encouraged to share this with the community. The following is a segment of this doc. So, figured I'd drop it on github for discussion!

Dawn

At the start of the game, cards on slot 1 are no more likely to be playable than on other slots. Therefore, we’ve developed this strategy to maximize the ability to efficiently “flush” playables out of players’ opening hands in such a way that doesn’t over-prioritize slot 1. We call this extra-early-game strategy “playing at Dawn”. We treat clues given at Dawn very similarly to those given in Duck variants. Specifically, we change some clue interpretations to fit with this chart.

Dawn extends for the first two rounds of any game: Turn 1 through Turn [2 * (player #)]

The general rule is that clues that would require more than one blind-play from a single player are instead asking for something special. This really only impacts clues given to certain 3s and 4s:

Dr-Kakashi commented 2 years ago

One point that I've found interesting. I randomly looked at 2 max score 5 suits 3p games and found that we ended the game with both 48 turns and 59 turns.

Dawn in a 3p game is 6 turns

This means that Dawn takes up between 10%-13% = ~11.5% of a game theoretically 6/48 = 12.50% 6/59 = 10.17%

Note: I had to do a huge edit on this post because I went over it with Ace and found some mistakes. I then had Val pull up 35,918 games for no variant, 3p, 5 suits and found the average turn count to be 59, so now we have a highly accurate number to start off.

10.17% is how much of the game you guys are asking to change and how much of the game is taken up by the new conventions.

If possible we need to find the # of games that have dawn vs games that don’t have dawn. This is important to see the # of games dawn and hyphenated are fighting for. This would also let us determine if we should even add a phase. For example, let's say we've found that 20% of games contain a dawn move, would it then be worth it to have players be aware of a phase that won't happen 80% of the time?

So let’s say that 50% of all games in existence have dawn moves in them. I highly suspect it is much lower because I've personally had trouble finding games with dawn moves in them. By my analysis 37.5% of those games are played better with dawn vs 62.5% of those games are played better with hyphenated.

Dawn: 50% * 37.5% = 18.75%

Hyphenated: 50% * 62.5% = 31.25%

We then need to take the difference of these two numbers. The reason why is because if we accept one over the other, then we are choosing to play inefficiently the rest of the time. Here if we choose to accept Dawn, then we are throwing out the hyphenated games that would’ve done better in games where a dawn move was available.

18.75% - 31.25% = -12.5%

2022-01-07_20-48-16

Here you can visually see what I mean. When there are fewer games where you can do dawn moves, there are fewer games for Dawn to take away from Hyphenated. Hopefully, this addresses the argument that if there is no dawn move available, then it doesn't matter. It does matter and it hurts Dawn in a big way.

Let's say that we have found that Dawn is better than Hyphenated. Then the next talk we should move towards is how much of a % improvement would allow for it to pass? I suspect this would lead to a philosophical discussion and then we would agree to an official benchmark. I believe it would be fair to compare the # of conventions added during the Dawn phase vs the % improvement. In my opinion, if it shows that we find that it's a 10% improvement, then the # of conventions added must be lower than 10. This means that in my eyes, even if there's an improvement, the complexity may not be worth it. That's the double edge sword of passing a set of conventions as a whole.

Jayhui-q commented 2 years ago

Massive Kudos @Dr-Kakashi for the testing through the lines and the work you put in. I am quite compelled. The data you present creates a much higher evidential burden that Dawn needs to show first. In some way, I am wistful because doing 3 discharges is fun and creative (but that should not be the goal, and often clouds what is better)

In theory, I understand the points of using 3's as discharges, while being able to immediately tell the clue receiver that they're holding a 2 or 3 (that's two away). The whole dawn theory hinges on every slot in a player's starting has an equal probability of being playable. However, it doesn't hold, once the first play/discard happens, which does happen during the 2 turn window, as it's much more likely that they have now drawn a playable on slot 1.

Yes, totally agreed. Past the first round, dawn isn't strictly more flexible. And If it's just for one round, the value of such a convention seems not sufficient

It's also consistent to just use hyphenated conventions throughout the entire game. It's odd to need to switch between convention sets (deleting 3 bluffs changes the hyphenated conventions). Even if we say Dawn is "marginally" better is it really worth adding an extremely short phase with a set of conventions (the sum of the complexity of those conventions). I find the case to be no.

Very well put.

I am now solidly in the "Nay" camp at the moment.

sjdrodge commented 2 years ago

I'm willing to put in some major work to do a more thorough statistical analysis of Dawn. I'm busy for the next week or so with other obligations, but if anyone wants to collaborate on that please let me know.

For now, I just want to say that there are some serious issues with using raw efficiency as our primary/only metric of evaluation, especially when evaluating for an easy variant (e.g. No Variant).

A metric that I've found to be extremely useful is BDR (bottom deck risk) count. Which is calculated by just counting the number of times that the team discards/misplays a valuable card when they do not already hold the other copy.

For me, Dawn was initially most exciting for variants where clues are often blocked (null, omni, pink, brown, etc.), and I similarly thought that it was a wash or maybe a slight minus for No Variant. When I started placing more value on actual win percentage, my personal evaluation changed.

In other words, the question I'm most interested in answering is whether Dawn enables the team to save more useful cards. Efficiency measures how many cards we play per clue, and I do not think our conventions struggle in that regard in No Variant.

sjdrodge commented 2 years ago

By the way, one thing that anybody can do to help, even if they don't feel like they have statistical know-how, is to simply play games w/ Dawn on and tag those games, or any old games. Use the "dawn" tag.

Dr-Kakashi commented 2 years ago

For now, I just want to say that there are some serious issues with using raw efficiency as our primary/only metric of evaluation, especially when evaluating for an easy variant (e.g. No Variant).

One of the arguments for dawn is increased efficiency, tempo, and clarity due to being able to have a toolbox to target slot 3, 4, and 5. I know it's just 1 metric, but it was one of the main metrics in the proposal.

A metric that I've found to be extremely useful is BDR (bottom deck risk) count. Which is calculated by just counting the number of times that the team discards/misplays a valuable card when they do not already hold the other copy.

I took this BDR metric and analyzed it for the 9 games I did earlier. I analyzed it in terms of any discards occurring during the dawn phase and what the players are doing on turn 6 or 7. It was 0 throughout all the games except for 1 game, where Hyphenated beat out Dawn in Game 6. The BDR metric is certainly something to be aware of, but it looks like it doesn't matter all too much in such a short phase.

For me, Dawn was initially most exciting for variants where clues are often blocked (null, omni, pink, brown, etc.), and I similarly thought that it was a wash or maybe a slight minus for No Variant. When I started placing more value on actual win percentage, my personal evaluation changed.

I also agreed that Dawn might be better in those variants; however, our sample size is way too small to know for sure. It's certainly worth it to look further. BTW kipipi plays with null conventions off.

In other words, the question I'm most interested in answering is whether Dawn enables the team to save more useful cards. Efficiency measures how many cards we play per clue, and I do not think our conventions struggle in that regard in No Variant.

I'm looking forward to see your results.

pianoblook commented 2 years ago

I analyzed it in terms of any discards occurring during the dawn phase and what the players are doing on turn 6 or 7. Why on earth would you only be analyzing BDR by looking at turn 6 and 7, specifically?

The whole point is getting 3s and 4s preemptively touched so that there will be a lower risk of them discarding in the future.

In case this is some sort of genuine misunderstanding, the point is to lower overall BDR over the course of the game, not just at the beginning.


As for your claim that Efficiency is a primary objective of dawn, where was that stated? I mean I genuinely have no idea which method(s) would be "most efficient", but I also don't care at all. I'd gladly sacrifice avg efficiency for higher avg discard quality. We don't need monster finesses, we don't need 3-for-1s. We do desperately need to save 3s and 4s

sjdrodge commented 2 years ago

One of the arguments for dawn is increased efficiency, tempo, and clarity due to being able to have a toolbox to target slot 3, 4, and 5. I know it's just 1 metric, but it was one of the main metrics in the proposal.

Fair enough. Won't be ignoring efficiency, don't worry!

I took this BDR metric and analyzed it for the 9 games I did earlier. I analyzed it in terms of any discards occurring during the dawn phase and what the players are doing on turn 6 or 7. It was 0 throughout all the games except for 1 game, where Hyphenated beat out Dawn in Game 6. The BDR metric is certainly something to be aware of, but it looks like it doesn't matter all too much in such a short phase.

As you probably realize, there are some major shortcomings with that approach, as much/most of the time it's not even possible to have a BDR in the first 2 rounds of the game due to zero discards, but which cards you touch during that period can have a major impact on the BDR count anyway.

I also agreed that Dawn might be better in those variants; however, our sample size is way too small to know for sure. It's certainly worth it to look further. BTW kipipi plays with null conventions off.

I'll just quote myself from earlier in the thread: "This is slightly out of scope for this proposal, but KiPiPi have a few additional Null-specific conventions coming down the pipeline, and I firmly believe once Dawn + those conventions are accepted, we can get rid of positional clues altogether, which in my mind is an enormous win."

I'm looking forward to see your results.

Collab?

Dr-Kakashi commented 2 years ago

I analyzed it in terms of any discards occurring during the dawn phase and what the players are doing on turn 6 or 7. Why on earth would you only be analyzing BDR by looking at turn 6 and 7, specifically?

Well turn 6 and 7 for a 3p games. I'm basically ending it on the last turn and the turn after dawn phase. I believe the focus of the discussion should be things that only happen during Dawn as the conventions only work during Dawn. If what you guys say is correct, then the dawn clues should prevent bottom deck a majority of the time, while also being able to get cards in difficult slots.

The whole point is getting 3s and 4s preemptively touched so that there will be a lower risk of them discarding in the future.

In case this is some sort of genuine misunderstanding, the point is to lower overall BDR over the course of the game, not just at the beginning.

As for your claim that Efficiency is a primary objective of dawn, where was that stated? I mean I genuinely have no idea which method(s) would be "most efficient", but I also don't care at all. I'd gladly sacrifice avg efficiency for higher avg discard quality. We don't need monster finesses, we don't need 3-for-1s. We do desperately need to save 3s and 4s

Higher efficiency means more cards are touched. Obviously, both lines are touching different cards, but theoretically, the more cards touched and chop moved essentially means the more cards saved from bottom deck.

As you probably realize, there are some major shortcomings with that approach, as much/most of the time it's not even possible to have a BDR in the first 2 rounds of the game due to zero discards, but which cards you touch during that period can have a major impact on the BDR count anyway.

I agree with this that the cards touched can have a major impact on the direction of the game, but it's specifically about the cards in the starting hand, as that's what the dawn phase is specifically for. Whether or not dawn can save those cards better than hyphenated. Would it be better to see which cards in the starting hand were touched or chop moved in both lines?

Collab?

I don't know, maybe? I've done a lot of work to bring up the points I've already presented. 😅 I am interested in what the Dawn crew does to defend Dawn.

pianoblook commented 2 years ago

A few random thoughts, now that I'm getting around to reading all these posts in full. Warning in advance: this will be a bit of a ramble.

1) First and foremost, I have zero interest in any of the calculations about raw efficiency. The only thing that truly matters at the end of the day would be winrate. Even if we could somehow definitively prove that non-Dawn conventions have a 2.5 avg efficiency in the first 20 turns, vs 2.0 avg efficiency with Dawn, I truly wouldn't care - show me them BDRs & total winrates!

2) I'd expect the frequency of Dawn Clues to increase as players acclimate and start leaving more opportunities available (e.g. a common blunder is giving monster 3-for-1 1s clues or other finesses). And, more importantly, even if it doesn't result in a Dawn Clue 80% of the time (random #), it's those remaining 20% seeds that may have otherwise given you trouble. In other words, easier seeds are easily winnable regardless of system. If you really want to plan your convention sets optimally, you want to try and design more outs for the harder seeds. (speaking of which, see the upcoming 3 Saves proposal 😆)

2) Even in games where true Dawn Clues don't occur, valuable extra information will still be gained. E.g. if a color clue gets slot 1 to blind-play, Cathy gets to mark it exactly as a 2, instead of 23. If a 3 clue gets slot 1, it's exactly 1-away (or positively IDs a different clued card). If a 3 clue gets a slot 1 blind-play, then it's magically a true Double Finesse. etc.

3) As alluded to, turning on Dawn will open the ability to turn on Precision 5-Tech - which I often forget is technically a different proposal. But yes, being able to positively-ID 5ND'd cards as kt, or proactively cm as x-away cards will be quite powerful.

4) woah that's a lot of number crunching for a sample size of 8 games. also how did you choose those 8 games? You said you "used the games I posted" - I posted 12 games though? At least. I'm very confused

pianoblook commented 2 years ago

I'm also willing to put in plenty of effort to stuff like this, especially if others are willing to collaborate. As Stephen said obviously the best way would be to crowdsource the effort in some fashion, although sadly it would be impossible to have "pure" results without perfect play from all parties. There's always the option of cross-checked hypotheticals I guess.

Serious question regarding my point 5. above: can you please explain your data collection methodology?

pianoblook commented 2 years ago

Some more questions"

Dr-Kakashi commented 2 years ago
  • How were those 8 games selected?

https://github.com/hanabi/hanabi.github.io/issues/589#issuecomment-798965449

  • Why were the other games excluded?

4 Charms are hyphenated clues. We should focus on dawn-only conventions. No, 2 away 4 double bluff examples were given. I believe it would be super rare to find, but unsure.

  • How do you propose we move forward?

I've already stated several proposals. I'll just repeat myself. First of all, we need to see how much of the games even have dawn moves in them. It's important because you are adding in a phase with conventions that only work during that phase. Naturally, we would want to know how often we would use it.

2nd of all I'm sure you would agree that if dawn moves were available, there would be times where the hyphenated line is better than the dawn line and vice versa. What I'm seeing in this post is you guys are saying that 100% or a majority of the time Dawn wins out, but where's your proof of that other than showing theory? They're both fighting for a piece of the pie and we want the one that gives us the most pie. We should know how many hyphenated lines we are throwing out.

How did you run the hypotheticals? Did you do two sets of hypos for each game you analyzed, optimizing for both systems?

I ran the hypotheticals using as efficient of a dawn line I've found. Then compare that to the most efficient hyphenated line I could find. I also tried to keep the two lines as similar as I could, so naturally, it deviates after the dawn move is done. This is the system Kimbi told me you 3 used to determine how good dawn was. She personally analyzed 2 of the games with me.

Do you think it's reasonable to be using Double Dark games, especially Dark Null games, in extrapolating for general convention use? (I sure don't)

Then why did you present these examples?

pianoblook commented 2 years ago

Okay, one thing that might be just a misunderstanding is that this was posted back in March 2021; the convention change about 4-charms-over-double-bluffs only happened >6 months later (see #755).

Then why did you present these examples?

As I wrote in that exact post, "I realized I didn't actually put in any specific in-game examples here, so here's a dump from the past few weeks". As in, I was simply hoping to give in-game examples of the moves, since at the time no one had ever seen them before.

sjdrodge commented 2 years ago

What I'm seeing in this post is you guys are saying that 100% or a majority of the time Dawn wins out

Let's avoid strawmen, please.

pianoblook commented 2 years ago

fwiw I love the idea of somehow designing a statistically significant study for testing Dawn vs non-Dawn strength. It's definitely pretty hard to show convincing evidence one way or the other, without a lot of hard data to back it up.

So I do want to say I respect what you're setting out to do. I would happily help analyze a bigger dataset + work on establishing useful metrics for comparing lines. And for sure if we can prove that it doesn't actually significantly increase winrate (in the statistical sense that is), then I certainly wouldn't want to see it passed.

But that all said, I mostly look forward to having the data back up what has felt abundantly clear from playing probably 500+ games of it over the past year

Dr-Kakashi commented 2 years ago

What I'm seeing in this post is you guys are saying that 100% or a majority of the time Dawn wins out

Let's avoid strawmen, please.

Quote from you:

I wanna give this proposal a huge thumbs up. It's very simple and powerful and I've been loving it so far. Dawn does a tremendous job of addressing the first issue (discard quality).

Quote from piper:

you have significantly more choices than you do without Dawn

Quote from jeff:

But for dawn, I think they are too good to be turned on despite the fact that they are a little bit complicated.

I know you guys are strong players. I assume you guys took the time to see how good Dawn is and would at least compare to Hyphenated. All I'm asking is to actually prove it. I took the time to analyze the games Piano presented and I do not find that dawn is as clear-cut a winner as you guys are stating it to be.

pianoblook commented 2 years ago

I don't see any reference to "100%" in those quotes - so I don't see why Stephen's point about strawmen isn't apt.

I'll rescind my interest in collaborating if it's going to just get toxic

Dr-Kakashi commented 2 years ago

I don't see any reference to "100%" in those quotes - so I don't see why Stephen's point about strawmen isn't apt.

I'll rescind my interest in collaborating if it's going to just get toxic

Obviously not verbatim, but it's clear you guys have determined Dawn to be better than Hyphenated for the first 2 turns of every player at the table. I'm wondering how you guys have analyzed that to be true? You don't need to rescind your interest in collaborating if you find me to be toxic. As you'll be collaborating mainly with Stephen to defend dawn.

waweiwoowu commented 2 years ago

I'm standing in a position where I believe people can choose which conventions (Dawn or non-Dawn) they want to apply while playing a variant like null if Dawn becomes official. We should create an "Optional Conventions" section for Dawn imo

Zamiell commented 2 years ago

We should create an "Optional Conventions" section for Dawn imo

that is too complicated, as i said earlier, the hyphen-ated conventions are one framework, you either play with them or you don't. if you don't want to play with dawn, then you should make a table that says "level 23 only" or something, in the same way that you would make a table that says "level 22 only" if you didn't want to play with Unnecessary conventions

waweiwoowu commented 2 years ago

Or specify the variants that Dawn applies to

sjdrodge commented 2 years ago

For what it's worth, I don't accept the notion that there's going to be some tremendous unprecedented proof burden before Dawn is accepted. I want to do the statistical analysis because I admit the possibility that we're evaluating this convention incorrectly, and I would like to find out if that's the case, but absent some compelling counter-evidence (which I don't think we are in possession of at present), we should of course go with the collective evaluation of the players who have weighed in, as we always do.

Dr-Kakashi commented 2 years ago

There is a lot to parse in this proposal - I think it might be better as a bunch of smaller proposals to evaluate better

I'm on this boat with Dobi. I'm not discrediting any of the conventions presented here. According to piano the 4 charm definition has changed. Maybe it's just better to present the conventions individually to be defined under the Hyphenated conventions, then we can actually do them over the course of the game, not just a short window.

I'm merely pointing out that the phase part may have been evaluated incorrectly, pending further statistical analysis. Also, I'm saying it's most likely the skill of the players that carried the team to victory, rather than dawn being the main contributor to put the team in the best position possible after the 2nd round to carry the team to victory.

sjdrodge commented 2 years ago

@Dr-Kakashi I am having trouble correlating the reported numbers with the screenshots. The clue counts don't seem to match, nor the number of played/touched cards. Also in many cases the turn number in the screenshot is not the same for both lines. Can you clarify?

pianoblook commented 2 years ago

Here's my attempt at rigorously cross-checking Kakashi's analyses. Enjoy the fruits of my Monday 😆 https://docs.google.com/document/d/1G7JlzAZcrcUPatNS_3Zj3tfpcHpj6DisDDMivn_fNZ4/edit

TL;DR: I came to fairly different conclusions in some games, and basically gave up trying to understand how Kakashi was calculating Efficiencies.

But If you want an actual summary, I added a Summary section at the end of the doc. To anyone who is interested in this topic, I do hope you at least read the last page or so of my reflections. I put a lot of effort into this, so I hope it's useful to some! I tried to go way-super-duper in depth with all this so that everything can be cross-examined and discussed.

Dr-Kakashi commented 2 years ago

I took a quick look at the google doc and I can already see some bias. You state that Dawn is for the first 2 turns of each player. Yet you analyze further to see the result? I'm fine with that and understand why you did so. However, why calculate the ending efficiency past dawn and then compare it to my results? My point was to calculate the ending efficiency when dawn ends. Obviously, there would be huge discrepancies between your calculated efficiency and mine. Especially on the 3p analysis, since my calculated efficiency was for 6 turns and you went and calculate efficiency up to 10 turns. Unfortunately, by doing that you are misleading people because your results can not be compared to mine because now we're talking about completely different states of the board. Of course, what I said previously doesn't hold in a 5p scenario because now we're both analyzing 10 turns.

Another thing of note is that I did the games and only deviated when a dawn move was performed in order to try to keep the games as close as possible to compare accurately and try to reduce bias as much as possible.

I'm analyzing Game 4 here because you consider it a "blow out" win for Dawn.

Dawn t2-4 3D on n3, r1 clue t5-7 4 Charm, y2 clue t8-10 5CE, Baton discard

2022-01-19_03-07-15 (Game 4 Dawn Line)

Non-Dawn 2.0 [piano’s suggestion] t2-3 y4 Charm to piano (reasoning: no reason to clue the dupe p4, and p3 is 5cmable)

I’m quite puzzled by his non-Dawn line. I have no idea why kimbi would choose the P4 Charm:

I'm super confused with your non-dawn line for game 4. Your reasoning to Charm with Y4 is...poor in my opinion. Kimbi has to pick the best line for the team from her position.

  1. From Kimbi's POV she can do a 3 for 1 getting P3 and P4 to get Y1 to play. Not only is it the more efficient clue, it immediately saves both P3 and P4.
  2. No need to worry about spending 2 save clue's on a future potential 5cm to save p3 and p4 when the other p4 gets discarded.
  3. Kimbi can see that piano will clue 1's to piper which piper will use to 1 ocm the 5. This means purple sets up a 3 for 1.
  4. Kimbi can now use the red 5 to 5CE the Y2 when it slides to ejection slot, which means now Piper can 1 ocm the brown 3 on the next go around.

Again there’s confusion about how he’s counting Efficiency - he counts purple to piper as a 2-for-1, when in fact the picked-up p4 is a known dupe.

I did not count purple to piper as a 2 for 1. I said it's a future prospect for a continuation line for the team, which leads to the 5th reason to do a P4 Charm:

  1. Kimbi can also see that purples will most likely be given to piper in the future, which means Piano can now Baton P4 back and redistribute cards, while also protecting Piano's chop.

You stress how important flexibility is for the team. Purple's here to me satisfies that and Kimbi can see all of this in her turn. To simply disregard the clue just because another p4 is visible isn't a good reason. To address your comment that it's Claustrophobic, if you do the line, you'll find that the team doesn't lose any of the cards you're worried about. That's one of the benefits of being highly efficient, there are more clue's in the bank to do things. Planning a line where the team will have to save in the future, in your line 2 saves, could potentially setup the team to be poorer on tempo and efficiency, which in your line happened.

Using Y4 to charm is just...bad. From Kimbi's POV Piano will still clue 1's to piper, but now piper can only 1 ocm the p4. Arguably he shouldn't do that, to keep p3, r5, and n3 further from chop. The 5CE is still available, however, notice that Kimbi knows she can't use the 5 to cm the p3. I think the action is potentially gone at this point, pending p4 discard and spending clues to save, which shows when you go further into your analysis. Also, it's hard for the team to get n3 and p3 at this point. By that I mean the team will now have to bank on Piper drawing a playable to use n3 or p3 to bluff.

t4 y to piper (no reason to extinguish so many juicy targets)

I don't like this reason. Blue's to piper is superior, not only does it get tempo on red 1, but it also allows piper to tcm piano. It's a 2 for 1 clue that can be turned into a 3 for 1 clue. Additionally, it puts y2 in finesse position for Kimbi. The juicy targets are still available.

You're choosing to do a 1 for 1 clue on a card that's in ejection position (you're essentially killing a juicy target for 5CE), in the hopes that Kimbi can do a 2 for 1 clue to bluff out B1? Worst-case scenario, Kimbi can't do anything so she just plays g1, then y2 plays. This means you lost your discharge opportunity because r1 will slide in the hand. Now the team has to 1 for 1 the red 1 and wait for piper to discard to get the 1's in his hand activated.

I can already see that you've chosen a poorer efficiency line in addition to trying to save as many cards as possible. Of course, it would be a "blowout" win for dawn to handicap Hyphenated line like this.

I was curious when you mentioned that I kept counting dupes, so I took a look at Game 3.

2022-01-06_14-45-41 (dawn3b)

Note that I edited my post for Game 3 because in my image for the Hyphenated line only 4 cards played, but I wrote down as 5 played.

@Dr-Kakashi I am having trouble correlating the reported numbers with the screenshots. The clue counts don't seem to match, nor the number of played/touched cards. Also in many cases, the turn number in the screenshot is not the same for both lines. Can you clarify?

Anyways, after the edit, Hyphenated is still better in my analysis. However, the main complaint of that post was me counting dupes. The cards touched for the Hyphenated line are: R2, G2, B2, R4, P4, plus B1, P2, G1, P3 = 9. The reason why we're adding those 4 at the end is they were finessed and are waiting to be played. It seems you guys didn't realize I'm counting finessed cards. So 4 cards played + 9 touched = 13 cards gotten. Now we divide that by clues given which were 5. 13 / 5 = 2.6. You can see I only counted p4 once and not twice.

It's good that you are reanalyzing the games to show the best dawn line. I'm not an expert dawn player and tried to find the best dawn lines with a group. I've jumped down to the summary and see that you've found dawn won 3 (combining slightly favored and favored) of the games while hyphenated won 2. You found 1 to be controversial and 2 to be relatively equal, which means those 3 are thrown out. So overall, do you feel it's enough to add in a phase and the set of conventions that are only available during dawn for a gain of 1 game overall? That 1 game mainly being Game 4 as it's considered the blowout win. We both agreed that Dawn wins for Game 7. To me, it isn't clear that dawn is superior.

Personally, I am in favor of not doing a dawn phase, but instead to have the conventions be used throughout the early game. Jeff and I actually presented some of the precision 5 tech stuff in the past. Such as 5NE (5 pulling a 2 away card) and 5ND (pulling trash). If you're proving that precision 5 tech is so good, we should just reopen those proposals to pass them officially. That way they'll be active throughout the early game, instead of just 2 short turns per player.

Correct me if I'm wrong, but to me, the bulk of the proposal is really turning off 3 bluffs and turning them into discharges instead. Again, in theory, I agree that in the starting hands, every slot is equally playable. However, once a play/discard happens it hurts the dawn proposal.

I feel that's unnecessary to make side comments towards me in your doc. I'm more than fine if I have made mistakes in my analysis, which I have done and corrected. Take comfort in knowing that I didn't analyze the games by myself, that I did it with a group, a few with Kimbi, and the rest with others. Our goal was to simply see if there was even a benefit to do dawn over hyphenated, which we all (excluding Kimbi) agree that the Hyphenated line was better per the points I've posted.

sjdrodge commented 2 years ago

The discrepancy between your reported numbers and the screenshots is not due to finessed cards. It's well understood that finessed cards should be counted as "played/touched" for efficiency purposes. Also if you could please explain why the lines that are being compared don't always end on the same turn number.

If we're going to compare results, let's make both analyses as high quality as we possibly can. At present I find it very easy to follow and check piano's lines and calculations, but I'm rather lost when trying to check yours. I get totally different numbers in the first few cases (I stopped checking at that point because it's time-consuming, so I'd like to wait until you've had a chance to double check them before checking the rest).

sjdrodge commented 2 years ago

Regarding game 4, I agree that piano's Hyphen-ated line is superior to the original offered line (yes, even though it intentionally has lower efficiency). Our conventions have pretty strict rules about which cards can be saved directly. When we want to save cards that cannot be directly saved, we have a few options:

The issue that piano identifies in game 4 is that we already have the second copy of p4 and aren't starving for clues, so touching p4 isn't particularly valuable. Similarly, p3 can be saved with a 5CM. y4 on the other hand is a unique card for which there's no reliable plan to save the card later, so giving the 3-for-1 4 charm w/ purple not only saves cards that we have adequate other tools to save, it also accelerates the demise of y4.

We are playing with small percentages here, but in my estimation a y4 bottom deck is more likely to cause a loss than being short one clue in a tight endgame. Especially given how many more ways we have to get reliable 2-for-1's in the late game these days.

Perhaps putting it this way also helps demonstrate why it's a bit myopic to limit our analysis of Dawn's effects purely to things which can be observed in the first two rounds of the game.

Edit: There are two additional factors that make the purple 4 charm less desirable:

  1. The p4 is on chop and orchestrating safe discards is important.
  2. If we have to pick one p4 to keep, I think we'd slightly prefer to keep the p4 that is in a different hand than the p3.
Dr-Kakashi commented 2 years ago

The discrepancy between your reported numbers and the screenshots is not due to finessed cards. It's well understood that finessed cards should be counted as "played/touched" for efficiency purposes. Also if you could please explain why the lines that are being compared don't always end on the same turn number.

There are some cases where the last person to act (last dawn clue) can give a clue. If they can, I show that, of course it moves the game forward by 1 turn. If they are just doing a play or discarding, then it just ends on the last dawn turn.

If we're going to compare results, let's make both analyses as high quality as we possibly can. At present I find it very easy to follow and check piano's lines and calculations, but I'm rather lost when trying to check yours. I get totally different numbers in the first few cases (I stopped checking at that point because it's time-consuming, so I'd like to wait until you've had a chance to double check them before checking the rest).

What do you want me to check and verify? The images shows all the clues and end state to show the cards gotten, played, and clue count.

sjdrodge commented 2 years ago

What do you want me to check and verify? The images shows all the clues and end state to show the cards gotten, played, and clue count.

Can you make sure that the numbers match the game states shown in the screenshot? I wasn't able to make them match for the first three deals, at which point I gave up.

Dr-Kakashi commented 2 years ago

What do you want me to check and verify? The images shows all the clues and end state to show the cards gotten, played, and clue count.

Can you make sure that the numbers match the game states shown in the screenshot? I wasn't able to make them match for the first three deals, at which point I gave up.

which game? Games 1-3?

sjdrodge commented 2 years ago

for the first three deals

Dr-Kakashi commented 2 years ago

for the first three deals

I'm asking because I don't know what you mean by deals. Are you talking about games 1-3?

sjdrodge commented 2 years ago

I'm asking because I don't know what you mean by deals. Are you talking about games 1-3?

Yes.

TheDaniMan commented 2 years ago

Here is my proposal for how to analyze how good a line is (motivation below):

You play the game normally for 3 (4 in case of 3p) rounds (2 rounds for dawn + immediate future). After this, for both lines, you look at all playable and all critical cards that were gotten for both lines. For the rest of the comparison, you target your clues only on getting those cards (as opposed to going for finesses on newly drawn cards) until the two lines have the same playable and critical cards saved. You then tally whichever cards are clued / discarded / played in whichever way seems best (efficiency, BDR, cards played, current clues remaining, number of safe discards...)

The point is that when different lines get different cards, you don't take into account which clues one line did which the other didn't, and this may matter if we're e.g. forced to give a 1-for-1 on a 5save anyway, so we need to take into account all the inefficient clues we procrastinated on and all the amazing clues we elaborately set up. The reason for 3-4 rounds is to allow a chance to pick up those great clues that we set up for later.

The main flaw is the cluing every single card gotten in the other line is not representative of reality, since often we give distractions to avoid giving saves and we don't need to give saves until much later in the game when there is no pressure, or an otherwise clued playable card has the other copy get drawn and gets bluffed etc. However, I think that allowing for the 1-2 extra rounds to get whichever clues we want brings down most of the convention-related difference, and after that I don't expect either convention will be harmed over the other by being forced to clue whichever cards the other one got.

pianoblook commented 2 years ago

Just want to share one of my final conclusions from the analysis:

My most important note: Since these discussions are mainly focused on whether or not to include Dawn in official H-Group conventions, we REALLY should be focusing more on analysing Easy & Simple variants! The games in this dataset are RIDICULOUSLY DIFFICULT and not reflective of the main H-Group convention goals. Of the 8 games I looked at here: Three were Dark Pink & Dark Null (1.43 Required Efficiency, Pace 8) One was 5p Special Mix 6 Suit (1.36 R.E, Pace 10) One was 5p Clue Starved & Null (1.79 R.E., Pace 10) One was 5p Black 5 Suits (1.56 R.E., Pace 5) The other two were ‘easy” (No Variant 5p, and Brown 6 Suit 3p)

If y'all are interested in really digging into this, I highly encourage us all to take a look at the Human Limits project.

I will be very excited to see how often the lines actually do diverge when using H-Group vs Dawn. If our guess of something around 20-30% of games is correct, then this should (eventually) help shed some light on the different outcomes.

Dr-Kakashi commented 2 years ago

I've talked with Stephen today and he wanted me to address why I'm missing 3 of the games from piano's list.

  1. One of the games was a mirage. I do not know mirage well enough to analyze the game, so I chose to not do so
  2. One of the games was a 4 charm. As stated in the 4 charm example I did show, 4 charms were added later and are considered an official Hyphenated Convention. This makes the 2 4 charm examples equal to the hyphenated line.
  3. The 3 discharge that is missing was one of the first games I reviewed, as such I didn't have proper notes, images, nor help from other players to accurately analyze the game. This is why I didn't post my analysis of that game.

The reason why I added a game that wasn't on the list was because I wanted to show that I am actively playing games with dawn on and not just doing them in hypothetical. My main goal was to show that I'm reducing bias and am fairly analyzing dawn properly.

What Stephen and I also found together was that, if piano and I are disagreeing with how to play game 4 involving 4 charms, then we are wasting our time. The reason being is that 4 charms can be done in both Hyphenated and Dawn lines. Choosing which 4 charms to use is just a preference of a player's playstyle. Unfortunately, that leads to different outcomes and we want to try to reduce bias as much as we can.

If this is the case, then in a scientific setting, the data point would be thrown out. This means it makes Piano's analysis of Dawn vs Hyphenated equal. Whereas my analysis still shows Hyphenated to be marginally better. If anything, this puts into perspective of how marginal we're talking about here.

We should be mainly focused on showing if dawn is good or not, by focusing on the games where dawn moves affect the game.

There is value in the analysis piano and I have done. Unfortunately, it seems we should agree on how to properly analyze dawn games to compare to hyphenated games.

  1. My analysis was to analyze up until dawn ends. However, Stephen has pointed out that there is a Horizon effect that adds bias to my results. I'll show an example using Game 1:

2022-01-06_14-17-48 (dawn1a) Above image is Dawn Line of a random game that we found dawn moves in.

2022-01-06_14-17-48 (dawn1b) Above image is Hyphenated Line

Here you can see that the Dawn line was able to get green 1 and red 1. In the Hyphenated line, though the efficiency is 2.67, the team is in a position where 2 1 for 1's may need to be given. Of course, Kakashi might draw a 4 to 4 charm the red 1, but we should look at the board state and not account for the RNG of the deck. The continuation of the Hyphenated line is in a position to drop the efficiency down. 1 1 for 1 clue would drop Hyphenated's Efficiency down to 2.25, which would be equal to dawn. Regardless of that, I have determined that Hyphenated is marginally better. The point of having high efficiency is to have more clues in the bank to account for inefficient clues that may need to be given.

  1. This leads into the 2nd way to analyze dawn, which was to have a set turn amount to end the analysis. The problem with this is that it's subjective in the sense of how many turns should we analyze out to? Piano chose 10, but of course, there will be bias when the player count is changed. Are 10 turns enough to demonstrate that dawn is the best continuation line vs hyphenated? If we don't do turns, how many rounds past dawn should we look at? We really don't have a way to judge that, in addition to the RNG of the deck affecting the results the further we go out. Piano has determined that Dawn is marginally better.

  2. The 3rd way to analyze dawn, is to play out the turns until all the target cards of the starting hand are gotten. Unfortunately, with this analysis, the turn count will be different between dawn vs hyphenated.

  3. I see Dani mention doing option 3 in addition to restricting it to 3 turns (1 round) past the end of dawn, which brings in the issues and questions of option 2.

Metrics used to analyze games is also important:

  1. Efficiency
  2. BDR - Bottom Deck Rate
  3. Tempo
  4. Flexibility - In the sense of having more options when giving clues

All in all, it is certainly complicated to see, if truly Dawn is better or not. The analysis provided by both piano and I, though extensive, just show marginal results one way or the other. It's really hard to see which way is better when we are looking at just 10% of the game.

In talking about complications, several players I've talked to have pointed out the learning curve on dawn may be too high.

  1. Marginal Results.
  1. Knowing when Dawn Ends.
  1. Level 25 and error rate.
  1. Length of time to learn.
  1. 2 sets of conventions.

Notice the target audience, when I mention the complications. As a purple player, if we have determined dawn to be good, of course, I'm going to learn it to the best of my ability. We are all here to get better at Hanabi together. For green players, except for the occasional hardcore players, then my complication list applies to them, which I feel should be taken to account, as they are the biggest player base.

Just to reiterate I personally feel it isn't worth it unless I do see "significant" improvement to the game.

At this point, I'm just going to wait for Stephen's analysis to see if any further insight is brought to the table.

Zamiell commented 2 years ago

At this point, I'm just going to wait for Stephen's analysis

is he still doing this?

Zamiell commented 2 years ago

earth to stephen or piano

Dr-Kakashi commented 2 years ago

I would give them time to figure things out and/or take a break from the activity in the past 2-3 months. They're quite exhausted. I'm sure they'll get to it, but I don't expect anything soon.

pianoblook commented 2 years ago

I already mentioned above that it seems like a terrible idea to base any large scale decisions on this small sample. We don't want to rely on Double Dark / Dark Null strats in constructing H-Group conventions. It's also a small enough sample that no outcome would be very significant anyway.

I will note that I definitely wouldn't characterize my results as showing no difference in outcomes - Dawn seemed very clearly like the better conventional agreement, based on these deals.

I favored Dawn in half of them (plus one that I marked 'controversial'), and the only two that ended up working out better for the non-Dawn maneuver could still have been chosen even while playing with Dawn. As I said in the doc, "this means that Dawn seems to be succeeding in its goal of increasing flexibility."

(And a note for anyone curious: the one instance of an unavailable-at-Dawn clue that came up was a UDD; purple t1 here - the difference either way ended up being trivial though.)

Once again I'll urge anyone who wants to start digging deeper into these differences to check out the Human Limits project - I personally really look forward to having a larger dataset where we can compare outcomes. Once we have the replays it should be easy to run an analysis of the first 10ish turns between - well, all the systems actually sound interesting!

Zamiell commented 2 years ago

what is the status of a pr on this?

Zamiell commented 2 years ago

i guess i'll close this since noone is responding