sdchasalow / Coursera-ReproducibleResearch

Stuff for Course 5 (Reproducible Research) in the Data Sciences Specialization at Coursera
0 stars 0 forks source link

Not fully random #1

Open maptracker opened 8 years ago

maptracker commented 8 years ago

AFAICT you're using the Latin square to help assure that a person does not hookup with themselves. But it also means you're not sampling the full space as defined by the constraints. The square constrains the set of hookups that each person gets assigned.

For example, if the column selection from a group of 12:

out <- out[ , sample.int(n - 1, size = p), drop = FALSE ]

... gives me out[, c(5,8,9)]:

      [,1] [,2] [,3]
 [1,]    6    9   10
 [2,]    7   10   11
 [3,]    8   11   12
 [4,]    9   12    1
 [5,]   10    1    2
 [6,]   11    2    3
 [7,]   12    3    4
 [8,]    1    4    5
 [9,]    2    5    6
[10,]    3    6    7
[11,]    4    7    8
[12,]    5    8    9

... Then the first row (to whomever it goes) will provide 6, 9 and 10. It should be possible for another person to also, by chance, be assigned those same 3 hookups; But they can't, due to the way the Latin square is constructed - it is enforcing a non-random relationship between the rows. This means you're not truly sampling the entire space.

I still have faith you can do better than this!

sdchasalow commented 8 years ago

I’m not so sure. I do agree that this might not sample the entire space. But, I think it is not too bad. Remember that I am assigning a random permutation of the names to the digits. ANY given person can be assigned “1”, and thus anybody might end up getting the row (6, 9, 10). In any case, for this particular application we don’t really care, do we? I know, from a purely intellectual standpoint it would be nice to have a truly uniform solution, in which all possible outcomes are equally likely.

It should be possible to enumerate all possible allocations for a smallish problem, and see if there are some which my algorithm cannot achieve. We even could do a few billion simulations to tabulate the frequencies. Don’t have time for that just now, though…

From: Charles Tilford [mailto:notifications@github.com] Sent: Tuesday, 16 February 2016 16:29 To: sdchasalow/Coursera-ReproducibleResearch Coursera-ReproducibleResearch@noreply.github.com Subject: [Coursera-ReproducibleResearch] Not fully random (#1)

AFAICT you're using the Latin square to help assure that a person does not hookup with themselves. But it also means you're not sampling the full space as defined by the constraints. The square constrains the set of hookups that each person gets assigned.

For example, if the column selection from a group of 12:

out <- out[ , sample.int(n - 1, size = p), drop = FALSE ]

... gives me out[, c(5,8,9)]:

  [,1] [,2] [,3]

[1,] 6 9 10

[2,] 7 10 11

[3,] 8 11 12

[4,] 9 12 1

[5,] 10 1 2

[6,] 11 2 3

[7,] 12 3 4

[8,] 1 4 5

[9,] 2 5 6

[10,] 3 6 7

[11,] 4 7 8

[12,] 5 8 9

... Then the first row (to whomever it goes) will provide 6, 9 and 10. It should be possible for another person to also, by chance, be assigned those same 3 hookups; But they can't, due to the way the Latin square is constructed - it is enforcing a non-random relationship between the rows. This means you're not truly sampling the entire space.

I still have faith you can do better than thishttps://gist.github.com/maptracker/f0ec01bed4d1c1583bf6!

— Reply to this email directly or view it on GitHubhttps://github.com/sdchasalow/Coursera-ReproducibleResearch/issues/1.


This message (including any attachments) may contain confidential, proprietary, privileged and/or private information. The information is intended to be for the use of the individual or entity designated above. If you are not the intended recipient of this message, please notify the sender immediately, and delete the message and any attachments. Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited.

maptracker commented 8 years ago

I agree it's sufficient for our immediate needs. I was approaching this from the intellectual perspective, that I wanted a solution that preserved full randomization while also enforcing all row and column constraints. In situations where this involved experimental randomization I'd argue that you'd want to assure you sampled the full space as much as possible.

sdchasalow commented 8 years ago

Agreed. Prove I'm not. On Feb 17, 2016 8:13 AM, "Charles Tilford" notifications@github.com wrote:

I agree it's sufficient for our immediate needs. I was approaching this from the intellectual perspective, that I wanted a solution that preserved full randomization while also enforcing all row and column constraints. In situations where this involved experimental randomization I'd argue that you'd want to assure you sampled the full space as much as possible.

— Reply to this email directly or view it on GitHub https://github.com/sdchasalow/Coursera-ReproducibleResearch/issues/1#issuecomment-185198453 .

maptracker commented 8 years ago

Mm. My empirical toy counter-example just fell apart. I'm investigating bigger toys.

maptracker commented 8 years ago

Ok... Imagine a group of many (say 100) entries, and you want to assign 3 hookups to each. Say your first row (any row, actually) randomly selects c(i, i+1, i+2). It is then impossible for any other row to contain c(i, i+10, i+15), even though that's a valid outcome for most rows. Regardless of what i is, the fact that a row contains i+1 and i+2 means the columns were adjacent to one-another. Because of how you built your Latin square, that means that all other rows that have x will have the other values within x±2. So you can never produce a peer that has been assigned i and any value greater than 2 away from i. So you're not sampling the full space.

In fact, as the group size increases, I suspect you sample an increasingly smaller set of all possibilities, and end up excluding most valid random outcomes. I picked i+1 and i+2 as the reference row for mental simplicity, but the relationship holds for other possibilities; If you had started with c(i, i+10, i+30) then you wouldn't be able to find c(i, i+3, i+7), for example.

You can avoid this, I think, by randomizing the column order of your Latin square. But then you have the challenge of preventing self-assignment (easy) and multiple-assignment (harder).

sdchasalow commented 8 years ago

Not completely convinced by your argument; I think randomly allocating the names to the positive integers reduces the effect of the built-in constraint you describe. For example (with i = 1), I cannot have both (1, 2, 3) and (1, 11, 16). However, remember that order within a row doesn’t matter. And I could have both (1, 2, 3), and (4, 5, 1), which is the same as (1, 2, 3) and (1, 4, 5). Now, (1, 4, 5) might not LOOK like (1, 11, 16), but because of the random allocation, either (1, 4, 5) or (1, 11, 16) could end up mapping to (A, D, E).

However, I do notice one different constraint imposed by my algorithm but not required by the problem specification. My algorithm cannot allocate a given pair of names more than once. For example, with 5 names, I can never give both the 2nd and 3rd names (together) to more than one reviewer, even though that would be a valid allocation, e.g. (2, 3), (1, 5), (5, 4), (3, 2), (4, 1). In fact, this is a limitation of my particular latin-square algorithm; it cannot generate all possible latin squares. Here, for example, is a latin square it cannot generate:

1 2 3 4 5 2 1 5 3 4 3 5 4 1 2 4 3 2 5 1 5 4 1 2 3

I did know about this limitation. What I do NOT know: could a better latin square algorithm, which could generate all possible latin squares with equal probability, solve our current problem?

From: Charles Tilford [mailto:notifications@github.com] Sent: Thursday, 18 February 2016 09:43 To: sdchasalow/Coursera-ReproducibleResearch Coursera-ReproducibleResearch@noreply.github.com Cc: Scott D. Chasalow scott.chasalow@gmail.com Subject: Re: [Coursera-ReproducibleResearch] Not fully random (#1)

Ok... Imagine a group of many (say 100) entries, and you want to assign 3 hookups to each. Say your first row (any row, actually) randomly selects c(i, i+1, i+2). It is then impossible for any other row to contain c(i, i+10, i+15), even though that's a valid outcome for most rows. Regardless of what i is, the fact that a row contains i+1 and i+2 means the columns were adjacent to one-another. Because of how you built your Latin square, that means that all other rows that have x will have the other values within x±2. So you can never produce a peer that has been assigned i and any value greater than 2 away from i. So you're not sampling the full space.

In fact, as the group size increases, I suspect you sample an increasingly smaller set of all possibilities, and end up excluding most valid random outcomes. I picked i+1 and i+2 as the reference row for mental simplicity, but the relationship holds for other possibilities; If you had started with c(i, i+10, i+30) then you wouldn't be able to find c(i, i+3, i+7), for example.

You can avoid this, I think, by randomizing the column order of your Latin square. But then you have the challenge of preventing self-assignment (easy) and multiple-assignment (harder).

— Reply to this email directly or view it on GitHubhttps://github.com/sdchasalow/Coursera-ReproducibleResearch/issues/1#issuecomment-185750482.


This message (including any attachments) may contain confidential, proprietary, privileged and/or private information. The information is intended to be for the use of the individual or entity designated above. If you are not the intended recipient of this message, please notify the sender immediately, and delete the message and any attachments. Any disclosure, reproduction, distribution or other use of this message or any attachments by an individual or entity other than the intended recipient is prohibited.

maptracker commented 8 years ago

"I cannot have both (1, 2, 3) and (1, 11, 16). However, remember that order within a row doesn’t matter."

Still is a problem. Pick any row from your selected columns. Represent the row as c(x, x+delta1, x+delta2). For any other row, pick an arbitrary member y. The other two values then must be one of y±delta1, y±delta2 or y±(delta1-delta2), and not all combinations will be possible. Column-order-agnosticism does not help here; the fixed offsets between the columns constrains the possible row combinatorics. The rows have a strong correlation between them.

"could a better latin square algorithm, which could generate all possible latin squares with equal probability, solve our current problem?"

Yes, probably. The Latin square means you don't have to worry about multiple-assignment in a row, and assures that all people get p hookups. A random Latin square would still benefit from those "automatically-met" constraints. But you'd now need to do a bit more work, I think, to prevent self-assignment. Do you have a concise trick to then distribute row labels in such a case?