a_estimation_example - Githubissues

rebeccajohnson88 commented 4 years ago

Bigger picture:

[ ] Better understanding why we do not subset to mothers == true when predicting in make_subgroup_results

In make_aggregate_results, after the estimation/in line with the procedure as described in text, the code filters to mothers == true so that predictions are at observed covariate values for mothers when estimating difference

to_predict <- d_case %>%
    group_by() %>%
    filter(mother)

In equivalent process for make_subgroup_results, code doesn't seem to filter to mothers == TRUE, so i think generates predictions for each distinct age (so nrow(df) == unique(age)) but i think (and could be r versions thing) might end up then predicting at covar values for race/etc for a mix of mothers and non mothers?

to_predict <- d_case %>%
    group_by(age) %>%
    filter(1:n() == 1) %>%
    group_by()

In general, I was less sure about how to think about the age-subgroup predictions so might also be intentional

Smaller clarifications:

[ ] Explicitly filtering out missing wage from analytic sample

Right now, the first part does checks on inclusion like non-missing wage, to take sample down from 25160 to 18075. But then the flow is:

Subsetting from d_all to d by removing those with missing logged wages

d <- d_all %>%
  ## Remove those with missing hourly wages. We estimate models without them
  filter(!is.na(ln_wage)) %>%
  mutate(num_with_wage = n())

right_join to d_all rather than d in below chunk re-includes those folks in order to check for all groups:

support_data <- d %>%
  group_by(age, educ, race, married) %>%
  mutate(has_support = n_distinct(mother) == 2) %>%
  filter(1:n() == 1) %>%
  select(age, educ, race, married, has_support) %>%
  right_join(d_all, by = c("age", "educ","race","married")) %>%
  mutate(has_support = ifelse(is.na(has_support),F,has_support))

Then, function uses support_data and only filters on basis of has_support == TRUE

I think the missing wage ppl end up getting implicitly excluded through the na.rm = TRUE in weighted mean and listwise deletion in lm, but thought it might be good here in the updated code if I include the missing wage flag, in addition to the has_support flag in the filter_flags vector and using it in initial restriction (haven't updated yet)

d_case <- full_data[rowSums(full_data[ , filter_flags]) == length(filter_flags),
                      ]

[ ] Normalizing weights in main estimation versus cross-validation

This is just clarifying since I'm less clear on rationale for normalizing weights other than wanting to sum to full sample size, but wasn't sure if there's a clear reason to not normalize weights within train and test in the make_cv_results call?

[ ] Reasons to retain multiple persons per serial?

I'm less familiar with the structure of ipums, but it seems like for the support_data, there's fewer unique identifiers (serial since all same month/year) than rows, and that this indicates that there are observations with multiple persons in the household as indicated via the pernum indicator. Is there a reason to retain multiple (what code currently does) or is pernum = 1 special in some way that justifies filtering to pernum == 1?

ilundberg commented 4 years ago

I'll use your names for each checkbox and answer these here. Then, I'll send another comment with some structural changes.

[ ] Better understanding why we do not subset to mothers == true when predicting in make_subgroup_results That was an error. Thanks for catching it. I substantially revised the code to avoid problems like this. In the revision, I am making one function that estimates both the aggregate and the age-specific results, so that we cannot have a problem like this again. In that call, we restrict to mothers.
[ ] Explicitly filtering out missing wage from analytic sample I changed the sample restriction process. Now, the READ_RAW_DATA part ends by producing two files: to_fit and to_predict. I think this reduces the risk of confusion.
[ ] Normalizing weights in main estimation versus cross-validation We should normalize everywhere. I believe this only affects the GAM, so we are normalizing by weights / mean(weight) per the MGCV documentation.
[ ] Normalizing weights in main estimation versus cross-validation Thanks. My inclination is to not normalize within each fold because that seems just a bit more complex. The weights total to nearly the same value in every fold because we assign folds systematically, first sorting by weight and then randomly permuting the assignments of each sequential set of 5.
[ ] Reasons to retain multiple persons per serial? Since our aim is to draw inference about persons rather than households, we are keeping all persons in each household. This is also consistent with our use of ASECWT, which is a person-level weight rather than a household-level weight. This does not undermine inference because we aren't assuming a simple random sample; instead, we use the replicate weights to take account of the multistage clustered sampling strategy. So I think multiple units per household is ok.

ilundberg commented 4 years ago

Some small tweaks. Then, I'll write again with more structural things.

I removed "filter_isposweight" because, as you point out, all the weights are positive.
Since education is never missing, I am removing that filter note.
Instead of data/clean, I am using a level-1 folder called "intermediate". I am avoiding "clean" because it has normative implications (e.g. cleaning improves data, which it does not always) and because I want to avoid a complex directory structure that we'd have to explain in Dataverse (where directory structures are not supported).
I removed printing of the groups off the common support. Keeping that makes a long printout and slightly more complex code, and it is something we never report.
I removed creating the rep_weights data frame because it is never used.

ilundberg commented 4 years ago

More structural thing:

I think there is a tension between

Producing code that clearly reproduces our paper and which people can easily read to understand.
Producing code that people can pick up like a software package and use for other problems.

In some ways, your revisions pushed toward (2). For example, you create objects to hold the names of the outcome and covariates, and create model formulas with objects like this: formula(sprintf("%s ~ %s + %s", outcome_varname, treatment_varname, paste(control_vec, collapse = "+")) Those choices seem appropriate for a software package (goal 2), but I think they get in the way of a readable replication package (goal 1). I think we're not going for the second goal anyhow, because there are still hard-coded things like a model object called ols_ageFactor_fit.

In my revision, I went back to the structure of my original code. In this approach, we only make things arguments to a function (e.g. weight_name) when it is something that we want to hand to the function with several different values in our application. I think this seems preferable for goal (1).

rebeccajohnson88 commented 4 years ago

Sounds good on all of the above --- that makes sense re the person versus household distinction and I agree that there isn't strong need to normalize weights within a given fold of train/test. closing this and feel free to delete!

ilundberg / replication

a_estimation_example #3