resampling-stats / resampling-with

Source repository for third edition of "Resampling: The New Statistics"
https://resampling-stats.github.io
Other
11 stars 9 forks source link

MRG: Develop Boolean arrays #114

Closed matthew-brett closed 1 year ago

matthew-brett commented 1 year ago

Work through more explanation of Boolean arrays.

Will continue through Monty Hall problem. After which, I think we have decent tutorial background for everything we need. Of course, might consider e.g. functions later.

matthew-brett commented 1 year ago

Feedback on setdiff implementation of Monty Hall?

matthew-brett commented 1 year ago

This one is now complete and ready for (any more) review.

stefanv commented 1 year ago

I'll review this afternoon.

stefanv commented 1 year ago

See https://github.com/matthew-brett/resampling-with/pull/1

stefanv commented 1 year ago

I often think that we should bite the bullet and introduce the difference between lists and arrays early on. Lists are just so convenient for building up results incrementally, and can easily be converted to arrays for further processing.

We now avoid that complexity, but at the cost of having to reason about fixed size containers only.

If we introduce lists, we can naturally also move on to sets (more elegant than NumPy set operations) and dics, which are the data structures a reader will likely be using in practice.

I would very much prefer if we did not introduce programming patterns that we would not follow ourselves, even if they seemingly come at an initial reduction in explanation complexity.

stefanv commented 1 year ago

I have not yet read your setdiff1d version of the code, but I wrote down my initial implementation of the trial, in case that's helpful for comparison:

import numpy as np

rng = np.random.default_rng()

trial_stay = []
trial_switch = []

for i in range(1000):
    # There is a prize behind one of the doors
    # We shuffle the doors
    doors = rng.permutation([0, 0, 1])
    door_nr = np.arange(len(doors))

    # We now pick a door at random
    my_door = rng.integers(0, 3)

    # The host picks a door that is not our door and that has a donkey behind it
    host_door = my_door
    while (host_door == my_door) or (doors[host_door] == 1):
        host_door = rng.integers(0, 3)

    remaining_door_value = doors[(door_nr != host_door) & (door_nr != my_door)]

    trial_stay.append(doors[my_door] == 1)
    trial_switch.append(remaining_door_value == 1)

print(np.sum(trial_stay) / 1000)
print(np.sum(trial_switch) / 1000)

It would be nice to have a more elegant way to say "index the array with all other than x, y".

When I first met the Monty Hall problem, I found it useful to extend it to a variant with 100 doors. There, the solution becomes more intuitive: you choose a door, the host opens 98 doors without prizes, and you are asked whether you want to stick to your original choice or switch.

stefanv commented 1 year ago

Another gotcha I picked up with this example and the ships: you have to be very careful to replicate the problem exactly. E.g., in the ship example, if you fail to shuffle the ships, you're sunk. In Monty Hall, if you try to make the host door selection by choosing a door at random, and filtering out trials where that door has a prize behind it, you fail.

I think this is worth a sentence or two in each problem?

stefanv commented 1 year ago

I quite like your set approach. Here's how I put it together, avoiding the if statement:

import numpy as np

rng = np.random.default_rng()

win_if_door_kept = np.zeros(1000, dtype=bool)
win_if_door_changed = np.zeros(1000, dtype=bool)

for i in range(1000):
    doors = np.array([0, 1, 2])

    # The door with the car
    car_door = rng.integers(0, len(doors))

    # Our chosen door
    my_door = rng.integers(0, len(doors))

    # The host has to choose from the other doors:
    # those that are not ours, nor contain the car behind them
    other_doors = np.setdiff1d(doors, [my_door, car_door])
    host_door = rng.choice(other_doors)

    # We now have the option to switch doors.
    # The only door that remains is the one that neither the host nor us chose.
    remaining_doors = np.setdiff1d(doors, [my_door, host_door])
    switch_door = rng.choice(remaining_doors)

    # Now, we can record whether we would have won had
    # we kept our original choice, or had we changed it
    win_if_door_kept[i] = (my_door == car_door)
    win_if_door_changed[i] = (switch_door == car_door)

# Tally wins for each scenario
print(np.sum(win_if_door_kept) / 1000)
print(np.sum(win_if_door_changed) / 1000)
matthew-brett commented 1 year ago

On introducing lists - I agree we need those - I do that initially in https://github.com/resampling-stats/resampling-with/blob/main/source/sampling_tools.Rmd#L163 .

For sets - I did have a go at using those, but I found myself having to introduce the curly bracket notation, and pop, and we really don't have much use for them elsewhere, which is why I went for the current code.

For not using e.g. np.setdiff1d ourselves - that's true - but it's pretty easy to explain, and I think it does give the most idiomatic code for this problem. Here's the code from a previous commit where I did a bit more cleaning:

# Code more compact, array comparison at end.
my_doors = np.zeros(10000)
car_doors = np.zeros(10000)
switch_doors = np.zeros(10000)

doors = [1, 2, 3]

for i in range(10000):
    my_door = rnd.choice(doors)
    car_door = rnd.choice(doors)
    # Which door will Monty open?
    montys_choices  = np.setdiff1d(doors, [my_door, car_door])
    # There could be one or two doors left, rnd.choice will
    # select the only door, or randomly from two doors.
    montys_door = rnd.choice(montys_choices)
    # Now find the door we'll open if we switch.
    switch_doors[i] = np.setdiff1d(doors, [my_door, montys_door])
    my_doors[i] = my_door
    car_doors[i] = car_door

I think that's about as idomatic as can be - no loops, no ifs - logic is clear - so yes, I would do it that way. It's just that, to get to the cleaner version I had to spend a little time explaining e.g. that rnd.choice([1]) will always give 1, and that arr[0] = arr2[:1] is OK - time that I thought would distract from this still fairly early piece of exposition.

matthew-brett commented 1 year ago

Another gotcha I picked up with this example and the ships: you have to be very careful to replicate the problem exactly. E.g., in the ship example, if you fail to shuffle the ships, you're sunk. In Monty Hall, if you try to make the host door selection by choosing a door at random, and filtering out trials where that door has a prize behind it, you fail.

I think this is worth a sentence or two in each problem?

Could you suggest those sentences? It is possible they will introduce more confusion by introducing problems they hadn't thought of?

stefanv commented 1 year ago

On introducing lists - I agree we need those - I do that initially in https://github.com/resampling-stats/resampling-with/blob/main/source/sampling_tools.Rmd#L163 .

For sets - I did have a go at using those, but I found myself having to introduce the curly bracket notation, and pop, and we really don't have much use for them elsewhere, which is why I went for the current code.

For not using e.g. np.setdiff1d ourselves - that's true - but it's pretty easy to explain, and I think it does give the most idiomatic code for this problem. Here's the code from a previous commit where I did a bit more cleaning:

# Code more compact, array comparison at end.
my_doors = np.zeros(10000)
car_doors = np.zeros(10000)
switch_doors = np.zeros(10000)

doors = [1, 2, 3]

for i in range(10000):
    my_door = rnd.choice(doors)
    car_door = rnd.choice(doors)
    # Which door will Monty open?
    montys_choices  = np.setdiff1d(doors, [my_door, car_door])
    # There could be one or two doors left, rnd.choice will
    # select the only door, or randomly from two doors.
    montys_door = rnd.choice(montys_choices)
    # Now find the door we'll open if we switch.
    switch_doors[i] = np.setdiff1d(doors, [my_door, montys_door])
    my_doors[i] = my_door
    car_doors[i] = car_door

That code reads much better to me, and I don't see where the slicing has to be explained.

Choice of one option being that option should be intuitive.

matthew-brett commented 1 year ago

That code reads much better to me, and I don't see where the slicing has to be explained.

Choice of one option being that option should be intuitive.

Well - I think we'd have to explain it - what I think would happen is that, once explained they would think - oh yes, of course that is what it would do, but without explanation they would look at it and think - wait, where did the length 1 case go?

matthew-brett commented 1 year ago

That code reads much better to me, and I don't see where the slicing has to be explained.

The code is effectively doing this:

switch_doors_array = np.array([the_door])  # Not a scalar, but a 1D one-element array.
switch_doors[i] = switch_doors_array

Now you and I know that it's OK to use a 1D one-element array there, but I think our students will think - wait - that's any array, it's not a number - why does that work?

matthew-brett commented 1 year ago

In general - I'm repeating myself - I think we're going to have to do a bit of lumbering along, early on - it isn't going to be elegant - because you have to have a good grounding to be elegant.

stefanv commented 1 year ago

That code reads much better to me, and I don't see where the slicing has to be explained.

The code is effectively doing this:

switch_doors_array = np.array([the_door])  # Not a scalar, but a 1D one-element array.
switch_doors[i] = switch_doors_array

Now you and I know that it's OK to use a 1D one-element array there, but I think our students will think - wait - that's any array, it's not a number - why does that work?

I missed that. You'll see my code didn't have that piece of magic.

matthew-brett commented 1 year ago

I missed that. You'll see my code didn't have that piece of magic.

Yes, but I think you have switch_door = rng.choice(remaining_doors) - which is surely going to confuse - because there should always be exactly one remaining door.

stefanv commented 1 year ago

Oh, I see what you're saying. Yes, so you'd have to grab the first element of that array and assign it to remaining door, probably, if you want to be explicit (which we should be).

matthew-brett commented 1 year ago

Could we compromise, explain rnd.choice([1]), and then:


for i in range(10000):
    my_door = rnd.choice(doors)
    car_door = rnd.choice(doors)
    # Which door will Monty open?
    montys_choices  = np.setdiff1d(doors, [my_door, car_door])
    # There could be one or two doors left, rnd.choice will
    # select the only door, or randomly from two doors.
    montys_door = rnd.choice(montys_choices)
    # Now find the door we'll open if we switch.
    switch_doors = np.setdiff1d(doors, [my_door, montys_door])
    switch_doors[i] = switch_doors[0]  # There is only one door remaining.
    my_doors[i] = my_door
    car_doors[i] = car_door

?

stefanv commented 1 year ago

That works, thanks!

matthew-brett commented 1 year ago
    # There could be one or two doors left, rnd.choice will
    # select the only door, or randomly from two doors.
    montys_door = rnd.choice(montys_choices)

Oh dear oh dear - R's sample can't distinguish between a scalar and an array with a single element, so it treats an array with a single element as being an integer n, and interprets this as 1:n. I mean that sample(c(3), size=1) will give a random sample from 1:3. Do you mind leaving the conditional in there for R?

matthew-brett commented 1 year ago

Have a look now? (Haven't addressed list of lists - suggestions welcome).

stefanv commented 1 year ago

Oh dear oh dear - R's sample can't distinguish between a scalar and an array with a single element, so it treats an array with a single element as being an integer n, and interprets this as 1:n. I mean that sample(c(3), size=1) will give a random sample from 1:3. Do you mind leaving the conditional in there for R?

I just read about the workarounds on S/O, and none of them are very satisfying. Pretty nasty :(

stefanv commented 1 year ago

Another comment on the ships problem: I'd make clear the relationship between "ship" and "bucket". It wasn't immediately clear to me why you would start with three buckets, I suppose since I immediately went to code in my head where I can call buckets ships :D

stefanv commented 1 year ago

I.e.,

  1. Get three buckets each with two pieces of paper: one with "Gold", "Gold", one with "Gold", "Silver", and one with "Silver", "Silver".

  2. Get three buckets, one to represent each ship. In each we place two pieces of paper to signify its treasure chests: "Gold", "Gold" for the first, "Gold", "Silver" for the second, and "Silver", "Silver" for the last.

stefanv commented 1 year ago

Trying to figure out the warning around "shuffle the buckets", it feels to me as if the abstraction introduced when translating the original problem (ships, chests) to a physical realization (buckets, pieces of paper) introduces a layer of confusion.

Now, when we model the buckets and pieces of paper, it is very easy to not think about shuffling, because we are trying to implement our imposed model of the problem.

If we wrote about the original, we could have said: "We approach the ship. We do not know whether from the bow or the stern, so when choosing a chest we pick one randomly."

Once we're in the "buckets" realm, the explanation now becomes "Remember that, when picking from a bucket, we do not know which piece of paper we pick; so we have to pick randomly." --- which really has nothing to do with the problem, it could easily be coincidence that randomly picking paper from a bucket is equivalent to randomly approaching a ship.

One place we could explain this is where we make the analogy to buckets, but there it's awkward: we have to explain why placing pieces of paper in a bucket is equivalent to approaching a ship randomly.

I suggest we forego the buckets interposer, and stick to the story given as directly as we can, having objects in the computer's memory represent each physical object as closely as we can. I.e., "A ship with two treasure chests" is an array with two values (and the array also has a front and a rear). Approaching the ship from bow or stern becomes part of the story; we don't know, so we may need to flip the array or not (can be achieved with a shuffle, or perhaps with np.fliplr if rnd.integers(2) == 1 or rng.choice(['bow', 'stern']) == 'stern'.

matthew-brett commented 1 year ago

The problem with the ships and the buckets is that this was a deliberate example where we first describe the physical procedure for simulation, then implement a version of the physical procedure in code. I like the physical procedure - I think people will find those easier to think about - and that will make it easier to see what the code is doing.

We can (I will) make the analogy more obvious in the text though.

stefanv commented 1 year ago

We can (I will) make the analogy more obvious in the text though.

Perhaps a general notion would be to distinguish between ordered and unordered objects. A bucket and a ship are both unordered.

matthew-brett commented 1 year ago

I've unpacked the ship / bucket thing a bit more. I've renamed "trunk" to "chest" - I think it's more idiomatic - at least in English English.

matthew-brett commented 1 year ago

Merged your pull-request, done a little more editing - I think it's ready now - what do you think?

matthew-brett commented 1 year ago

Now about the current "No gold in chest 1, chest 2 never opened"?