Investigate why generated puzzles have a lower difficulty

jkomoros commented 8 years ago

When generating a new batch of puzzles to deploy it became clear that the distribution of difficulties is skewed much farther left than before (or the mean shifted quite a bit). Presumably the ACTUAL difficulty hasn't changed, it's just the rating. It's unclear if it's actually better ratings now or not. If it is right, we'll need some way to nudge the generator to create harder puzzles for higher target difficulties.

jkomoros commented 8 years ago

Hmmm, turns out that the distribution is totally odd.

Currently on trunk: screen shot 2016-04-08 at 5 44 32 am

Goes from roughly 0 to ~0.7; bimodal distribution.

With a retrained model from analysis-pipeline based on current solve data: screen shot 2016-04-08 at 5 45 40 am

A normalish distribution... from 0.45 to 0.75.

So the distributions of difficulties can vary widely. Perhaps when generating a model we also need a step that prints a histogram of difficulties?

jkomoros commented 8 years ago

Is it partly that relative_difficulties is mostly scrunched in the middle?

jkomoros commented 8 years ago

When this issue was reported in #260 I first thought that just using a model from January would fix the problem. But when I generated a (manual) histogram using that old model, I got a result that looks like the first histogram image in this issue.

screen shot 2016-04-13 at 8 23 34 pm

That implies that it's not the model so much as perhaps the twiddlers. It could be that twiddleCommonNumbers is wreaking havoc. It could also be that when I flipped it from chain_dis_similarity to chainSimilarity and made it twiddleChainSteps that I messed up the logic in a big way.

Currently testing a model with all twiddlers turned off to see if it at least generates a model with reasonable coverage from 0.0 to 1.0

jkomoros commented 8 years ago

Hmm, the quick and dirty model with both twiddlers off also has the same problem, implying it's not the twiddlers or chain dissimilarity.

Next step: checking out the whole repo from ~January and seeing if puzzles generated at that point have the same problem.

jkomoros commented 8 years ago

Here's the model trained on the exact 758540e snapshot: screen shot 2016-04-13 at 9 07 16 pm

Comparing to the second histogram in this thread (eyeballing) it's the same mean, slightly looser std dev.

This implies that perhaps with more data the model is tightening the std dev naturally? I still don't understand why there seems to be either a low-ended bimodal distribution (1 and 3 in this thread) and a tighter normal distribution (2 and 4).

Next step: look at the relative_difficulties for the normal/bi-modal distributions and make sure those are distributed right.

jkomoros commented 8 years ago

Just thinking a bit more:

It's possible that there are two types of distributions: the spread/bi-modal things (which I expect has a weird/non-convergent relative_difficulties.csv underlying it), and a normal distribution centered around 0.6.

Focusing on the latter case, we'd expect that if we're randomly generating puzzles that actually have a difficulty of ~0.6 just naturally, that as we get better at rating them with a more robust model, the std dev will tighten in naturally. If this is true, then we need to figure out a way to get the generator to generate harder/easier puzzles, and likely the only knob we have for that is how many cells to try to leave unfilled. (One way to test this: how many unfilled cells are in the puzzles dokugen is creating? Are they all roughly the same?)

Next steps to investigate this hypothesis: look at the relative_difficulties.csv underyling the bimodal distributions to see if they look weird. If they do, file an issue to figure out why that non-convergent behavior is happening. Separately, try a change that in GeneratePuzzle takes a factor of how many cells it should try to leave unfilled (or set it randomly for now) and see if we get a wider distribution.

jkomoros commented 8 years ago

Thinking about GenerateGrid more closely, there's no obvious place to cause it to fill or unfill more cells. Trying dokugen with a few different symmetry values and looking at number of filled cells, there doesn't appear to be much way to nudge that strongly.

However, looking carefully at the relative_difficulties that generated the last histogram:

screen shot 2016-04-14 at 6 46 34 pm

the distribution of generated puzzles fits pretty evenly into the effective range of difficulties in the relative set.

This implies that the problem is occurring in the relative_difficulties generation: instead of being smoothed out properly they're being all scrunched in the middle.

... This implies that printing a histogram of relative difficulties in that phase would also be useful to help diagnose.

jkomoros commented 8 years ago

screen shot 2016-04-14 at 6 52 18 pm This is actually a better way to visualize the puzzle difficulties. It's not exactly the same as the generated histogram two pictures above, but it is pretty similar.

jkomoros commented 8 years ago

Conceptually what I want to do is figure out where the "bulk" of the difficulties distribution is (so in the above histogram, ~-0.3 to ~0.85) and then squash and stretch that distribution to be somewhat flatter and go from 0.0 to 1.0.

This will definitely squash the low and high ends to 0.0 and 1.0, but at the benefit of a much smoother distribution.

jkomoros commented 8 years ago

Just thinking out loud: I wonder if those expx set of puzzles (the very low difficulty ones) that are all accumulating at 0 are what squashed the relative difficulties distribution and made the main body have a smaller effective range? When did those get deployed?

jkomoros commented 8 years ago

in dokugen-analysis, when we save relative_difficulties we take the log and clamp to 0.1 to 0.9. We should probably revisit that transformation (for example, what does it look like if we just don't do it??), and look at it as our chance to squash and stretch the distribution a better way.

jkomoros commented 8 years ago

Currently once we have the unnormalized relative_difficulties we multiply by 10^8 and take the log of that. We're trying to take the unbelievably right skewed distribution and make it normally distributed. But for too high powers, we actually push it rightward too much. Experimentally it looks like 6 is a better power to raise 10 to than 8. ... But why is that? Does it differ when provided with different incoming distributions?

jkomoros commented 8 years ago

Looks like the right way to do this might be: First, normalize the data, where each row = Log(value * 10^x + 1) where x is iteratively bisected to find the value that will give a distribution with skew closest to 0. (It might make sense to trim off the top and bottom ~5% before the step, since the expx puzzles in practice mess with the skew quite a bit)

Then, calculate the mean and standard deviation. Squash and stretch the numbers so the mean is 0.5 and anything beyond 2 (3?) standard deviations from the original distribution is squashed to the tails.

jkomoros commented 8 years ago

Actually, if we do the thing proposed in the comment immediately above, we don't need to do anything other than very basic squashing/stretching ( val - min / (max - min)), no need to do anything about means/stddev.

If we're trimming off the top and bottom n% (experimentally looks like 3% is reasonable), doesn't that mean that we're feeding 6% less data into the machine learning model? Isn't it better to just set all of those to the min or max?

jkomoros commented 8 years ago

For future reference: sheet where I'm playing around with the logic: https://docs.google.com/spreadsheets/d/1yPTnmYztGCiFUg0ohyzESxXo0UL0JcNgUtnNNiTPl30/edit#gid=1842249963

jkomoros commented 8 years ago

Random thought: most of the same pipeline, but then fit the puzzles (including the ones on the edge) into a forced normal distribution. Figure out what the mean/stddev is, then line different difficulties along the distribution up with what it's expected to be. Perhaps with Normdist (Sheets equivalent)

jkomoros commented 8 years ago

This page describes how spreadsheets' Skew funciton is implemented

jkomoros commented 8 years ago

Gonum's stats package has Skew calculation

jkomoros commented 8 years ago

Once the main machinery is landed, we should use analysis-pipeline to compare keeping/leaving upper tail, amount of tail to remove, etc.

jkomoros commented 8 years ago

After implementing the tweak to the pipeline described here, the good news is that the distribution is far smoother (although not crazy more smooth). The bad news is the R2 of the model is 64 (compared to ~80) :-(

jkomoros / sudoku

Investigate why generated puzzles have a lower difficulty #253