sbstusa / spilloverdesign

A repository with code to produce and evaluate experimental designs to estimate causal effects in the presence of spillovers.
http://sbstusa.github.io/spilloverdesign
1 stars 4 forks source link

Improve choice of pairs #3

Closed jwbowers closed 8 years ago

jwbowers commented 8 years ago

This is the plan to improve our design.

Avoid overfull states

Try 1/4 of the counties in each state in the treatment group to avoid overly filling states.

Avoid cross-pair adjacency of treated to control

Choose best pair as closest in size and largest in number of farmers. Then choose next pair that is not adjacent but closest in size and largest in farmers, etc..

nahiggins commented 8 years ago

...just an additional thought here:

could throw in a little check at each stage in the algorithm that looks at how big our total sample is, so we know when to stop.

could do something like:

(1) generate a "round" variable each time a treatment-control-pair is identified (so the first treatment-control-pair in AK, AL, etc. would all have a "round" = 1)

(2) continue the algorithm for some number of rounds, generating a map each time

(3) calculate the total number of expected letters to be sent out after each round (call it EN)

(4) stop the algorithm when EN > 150,000

That way we don't end up playing around too much at the end. And if we wanted to look at alternative samples, we would have an easy set of alternatives, where the first alternative is the total sample when EN first becomes > 150,000, the second alternative is the total sample minus the last round, etc.

Does that make sense?

N

On Mon, Jan 4, 2016 at 2:10 PM, Jake Bowers notifications@github.com wrote:

This is the plan to improve our design. Avoid overfull states

Try 1/4 of the counties in each state in the treatment group to avoid overly filling states. Avoid cross-pair adjacency of treated to control

Choose best pair as closest in size and largest in number of farmers. Then choose next pair that is not adjacent but closest in size and largest in farmers, etc..

  • Pair choice to control cross-pair adjacency will be iterative, greedy.
    • Pair choice: 1000 versus 900 is better than 8 versus 7. Some way to choose pairs with larger counties. proportional difference better.
    • Pair choice: Avoid treated-to-control adjacency across state lines and within states.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3.

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jwbowers commented 8 years ago

I'll keep working on this tomorrow. I'm working to include this idea about keeping track of n. Right now, the algorithm is state specific under the idea that we do not want to have a particularly small number of counties in any one state. The total n process would probably require a more global approach: we can still pair within state, but we would do the sorting and choosing of pairs at a national level.

nahiggins commented 8 years ago

thanks Jake! What time do you think you'll be working? Just want to know if we can mesh up on this.

On Mon, Jan 4, 2016 at 10:52 PM, Jake Bowers notifications@github.com wrote:

I'll keep working on this tomorrow. I'm working to include this idea about keeping track of n. Right now, the algorithm is state specific under the idea that we do not want to have a particularly small number of counties in any one state. The total n process would probably require a more global approach: we can still pair within state, but we would do the sorting and choosing of pairs at a national level.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-168886164 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jakesbst commented 8 years ago

I'm working now. The recursive algorithm is a bit tricky.

Jake Bowers

Social and Behavioral Sciences Team

Office of Evaluation Sciences | General Services Administration jacob.bowers@gsa.gov amira.choueiki@gsa.gov | (202) 322-6714 | sbst.gov

On Tue, Jan 5, 2016 at 1:06 AM, nahiggins notifications@github.com wrote:

thanks Jake! What time do you think you'll be working? Just want to know if we can mesh up on this.

On Mon, Jan 4, 2016 at 10:52 PM, Jake Bowers notifications@github.com wrote:

I'll keep working on this tomorrow. I'm working to include this idea about keeping track of n. Right now, the algorithm is state specific under the idea that we do not want to have a particularly small number of counties in any one state. The total n process would probably require a more global approach: we can still pair within state, but we would do the sorting and choosing of pairs at a national level.

— Reply to this email directly or view it on GitHub < https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-168886164

.

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-168909341 .

nahiggins commented 8 years ago

ok. lmk how/if I can help. available all day to consult on this.

On Tue, Jan 5, 2016 at 9:04 AM, Jake Bowers notifications@github.com wrote:

I'm working now. The recursive algorithm is a bit tricky.

Jake Bowers

Social and Behavioral Sciences Team

Office of Evaluation Sciences | General Services Administration jacob.bowers@gsa.gov amira.choueiki@gsa.gov | (202) 322-6714 | sbst.gov

On Tue, Jan 5, 2016 at 1:06 AM, nahiggins notifications@github.com wrote:

thanks Jake! What time do you think you'll be working? Just want to know if we can mesh up on this.

On Mon, Jan 4, 2016 at 10:52 PM, Jake Bowers notifications@github.com wrote:

I'll keep working on this tomorrow. I'm working to include this idea about keeping track of n. Right now, the algorithm is state specific under the idea that we do not want to have a particularly small number of counties in any one state. The total n process would probably require a more global approach: we can still pair within state, but we would do the sorting and choosing of pairs at a national level.

— Reply to this email directly or view it on GitHub <

https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-168886164

.

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

— Reply to this email directly or view it on GitHub < https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-168909341

.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169009578 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jwbowers commented 8 years ago

One thing to work on: Decide on how to best rank pairs. Right now, I'm ranking on sizeDiffs/avgN, and breaking ties on avgN and before I was ranking on sizeDiffs and then breaking ties based on avgN. Here is the code from the chooseBestPairs function:

## pairsInOrder<-cbind(sizeDiffs,avgN)[order(sizeDiffs,-1*avgN,decreasing=FALSE),] ## sort by diff in size and then by avgN
pairsInOrder<-cbind(sizeDiffs,avgN,sizeDiffs/avgN)[order(sizeDiffs/avgN,-1*avgN,decreasing=FALSE),] ## sort by diff in size/avgN and then by avgN

Here are the pairs for AK.

            sizeDiffs  avgN
02100-02180         0  23.0 0.0000000
02060-02188         0   6.0 0.0000000
02164-02270         0   4.0 0.0000000
02013-02275         1   8.5 0.1176471
02050-02280         2  16.0 0.1250000
02068-02198         4  27.0 0.1481481
02110-02150        21  96.5 0.2176166
02130-02220        11  49.5 0.2222222
02016-02185         2   6.0 0.3333333
02090-02122       256 550.0 0.4654545
02070-02105         9  16.5 0.5454545

I'm attaching a csv file with all of the pairs (named .txt to force github to upload it) so you can play around if you have time. Or play with just the AK data above.

pairchars.txt

nahiggins commented 8 years ago

Two ideas for you to react to:

Idea 1:

Penalty function:

penalty <- function(x1,x2){ if( (x1-x2) != 0 ){ log(x1) + log(x2) - log(abs(x1 - x2)) }else{ log(x1) + log(x2) } }

Using this function, higher scores are better. As you can see, this favors large counties. Under this penalty function, a (500,1000) pairing scores slightly better (6.91) than a (300,500) pairing (6.62). But a (400,500) pairing scores better than both of those (7.60). A (25,25) pairing would score a 6.44, i.e. worse than all of these, simply because the county is so small. Given that it's hard to imagine detecting spillovers caused by a mailing to 25 people, this seems reasonable. You start to get competitive, so to speak, pretty quickly if the counties are well matched, however. So a (50,50) pairing scores 7.82, i.e. better than a (400,500) pairing. But a (400,500) pairing outscores a (45,50).

So we could fiddle with scale, but this seems to have the properties we are looking for.

Idea 2:

Quantiles:

Create size quantiles (could use deciles to keep the quantiles small). Make pairs randomly from within quantiles. Sample first from larger quantiles, then work down.

What do you think about these ideas?

N

On Tue, Jan 5, 2016 at 9:18 AM, Jake Bowers notifications@github.com wrote:

One thing to work on: Decide on how to best rank pairs. Right now, I'm ranking on sizeDiffs/avgN, and breaking ties on avgN and before I was ranking on sizeDiffs and then breaking ties based on avgN. Here is the code from the chooseBestPairs function:

pairsInOrder<-cbind(sizeDiffs,avgN)[order(sizeDiffs,-1*avgN,decreasing=FALSE),] ## sort by diff in size and then by avgN

pairsInOrder<-cbind(sizeDiffs,avgN,sizeDiffs/avgN)[order(sizeDiffs/avgN,-1*avgN,decreasing=FALSE),] ## sort by diff in size/avgN and then by avgN

Here are the pairs for AK.

        sizeDiffs  avgN

02100-02180 0 23.0 0.0000000 02060-02188 0 6.0 0.0000000 02164-02270 0 4.0 0.0000000 02013-02275 1 8.5 0.1176471 02050-02280 2 16.0 0.1250000 02068-02198 4 27.0 0.1481481 02110-02150 21 96.5 0.2176166 02130-02220 11 49.5 0.2222222 02016-02185 2 6.0 0.3333333 02090-02122 256 550.0 0.4654545 02070-02105 9 16.5 0.5454545

I'm attaching a csv file with all of the pairs (named .txt to force github to upload it) so you can play around if you have time. Or play with just the AK data above.

pairchars.txt https://github.com/sbstusa/spilloverdesign/files/78482/pairchars.txt

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169012556 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jwbowers commented 8 years ago

I like the first idea. I'll implement it and we can see how the map looks.

As of now you can see the new assignment mechanism at https://sbstusa.github.io/spilloverdesign/saturationDesign.html

I think it looks pretty good. I ended up restricting any adjacency between pairs rather than just controls because I'd like to keep pair choice as a function of fixed characteristics of counties rather than add any new randomness into the design phase --- such randomness would make standard errors and tests more difficult later.

The only thing I haven't done yet is to restrict the cross border treated-to-control adjacency.

nahiggins commented 8 years ago

Way cool. Let me know if I can help any other way. Otherwise I'll just be ready to take the output and run w/ it as soon as you're done.

(What about the idea of using "rounds" so that we can easily add/eliminate observations? Just wondering if that was easy / if you thought it made sense)

On Tue, Jan 5, 2016 at 10:58 AM, Jake Bowers notifications@github.com wrote:

I like the first idea. I'll implement it and we can see how the map looks.

As of now you can see the new assignment mechanism at https://sbstusa.github.io/spilloverdesign/saturationDesign.html

I think it looks pretty good. I ended up restricting any adjacency between pairs rather than just controls because I'd like to keep pair choice as a function of fixed characteristics of counties rather than add any new randomness into the design phase --- such randomness would make standard errors and tests more difficult later.

The only thing I haven't done yet is to restrict the cross border treated-to-control adjacency.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169043162 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jwbowers commented 8 years ago

I'm not sure that the rounds idea works with the code as written. I'm closing in on a solution to the cross-state problem now.

nahiggins commented 8 years ago

ok!

On Tue, Jan 5, 2016 at 11:37 AM, Jake Bowers notifications@github.com wrote:

I'm not sure that the rounds idea works with the code as written. I'm closing in on a solution to the cross-state problem now.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169054228 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jwbowers commented 8 years ago

What do you think? Should we close this issue? Seems like the pairs at least have no more adjacency across or within states. Some of the pairs may differ a lot in size (in absolute terms). But we choose the ones that perform best on the penalty() function up until we ran out of them (because of adjacency problems or we hit the budget of no more than 1/4 of counties assigned to treatment.)

You can see the final map here: https://sbstusa.github.io/spilloverdesign/saturationDesign.html#final-map

With 50% assigned to treatment, we get about 137363 farmers. If we assign at about .55 then we get 15000.

nahiggins commented 8 years ago

Wow! That certainly looks quite good!

I can't see a better way, can you? I think we've done it! (well, you've done it, anyway!)

On Tue, Jan 5, 2016 at 12:04 PM, Jake Bowers notifications@github.com wrote:

What do you think? Should we close this issue? Seems like the pairs at least have no more adjacency across or within states. Some of the pairs may differ a lot in size (in absolute terms). But we choose the ones that perform best on the penalty() function up until we ran out of them (because of adjacency problems or we hit the budget of no more than 1/4 of counties assigned to treatment.)

You can see the final map here: https://sbstusa.github.io/spilloverdesign/saturationDesign.html#final-map

With 50% assigned to treatment, we get about 137363 farmers. If we assign at about .55 then we get 15000.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169062725 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

jwbowers commented 8 years ago

I think this was a joint effort even if I did more typing. The experimentDat.csv file is in the main github repository and I also just updated it on googleDrive (it is not a sheet yet, so not easy to get straight from R, but a double click would fix that depending on your workflow).

In general I don't put data files and binary files (pdf, png, jpg, doc, xls) into github because github has some size limits and basically just makes copies of those files rather than nicely just maintaining differences. This time, however, our files are small and csv is a text format, so I'm comfortable having them on github.

Ok. I'll close this issue now.

nahiggins commented 8 years ago

Thanks!

I'll just go ahead and download the new experimentDat.csv file from the "master" branch!

Quick question in case you know of a fast way to do this: how to limit a character variable to a certain number of characters? Seems like it should be really straightforward, but I haven't gotten it to work yet. I need to cut off some of the longer names to fit using a 30 character limit.

Regular expressions was my first thought on how to do this. Ideas?

N

On Tue, Jan 5, 2016 at 12:31 PM, Jake Bowers notifications@github.com wrote:

I think this was a joint effort even if I did more typing. The experimentDat.csv file is in the main github repository and I also just updated it on googleDrive (it is not a sheet yet, so not easy to get straight from R, but a double click would fix that depending on your workflow).

In general I don't put data files and binary files (pdf, png, jpg, doc, xls) into github because github has some size limits and basically just makes copies of those files rather than nicely just maintaining differences. This time, however, our files are small and csv is a text format, so I'm comfortable having them on github.

Ok. I'll close this issue now.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169072590 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

nahiggins commented 8 years ago

doh! substr.

On Tue, Jan 5, 2016 at 12:53 PM, Nathaniel Higgins - MX-DETAILEE < nathaniel.higgins@gsa.gov> wrote:

Thanks!

I'll just go ahead and download the new experimentDat.csv file from the "master" branch!

Quick question in case you know of a fast way to do this: how to limit a character variable to a certain number of characters? Seems like it should be really straightforward, but I haven't gotten it to work yet. I need to cut off some of the longer names to fit using a 30 character limit.

Regular expressions was my first thought on how to do this. Ideas?

N

On Tue, Jan 5, 2016 at 12:31 PM, Jake Bowers notifications@github.com wrote:

I think this was a joint effort even if I did more typing. The experimentDat.csv file is in the main github repository and I also just updated it on googleDrive (it is not a sheet yet, so not easy to get straight from R, but a double click would fix that depending on your workflow).

In general I don't put data files and binary files (pdf, png, jpg, doc, xls) into github because github has some size limits and basically just makes copies of those files rather than nicely just maintaining differences. This time, however, our files are small and csv is a text format, so I'm comfortable having them on github.

Ok. I'll close this issue now.

— Reply to this email directly or view it on GitHub https://github.com/sbstusa/spilloverdesign/issues/3#issuecomment-169072590 .

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146

Nathaniel Higgins, Ph.D. Fellow, White House Social & Behavioral Sciences Team nathaniel.higgins@gsa.gov (202) 302-9146