ilundberg / replication

Replication files for working papers
4 stars 6 forks source link

a_estimation_example #3

Closed rebeccajohnson88 closed 4 years ago

rebeccajohnson88 commented 4 years ago

Bigger picture:

In make_aggregate_results, after the estimation/in line with the procedure as described in text, the code filters to mothers == true so that predictions are at observed covariate values for mothers when estimating difference

to_predict <- d_case %>%
    group_by() %>%
    filter(mother)

In equivalent process for make_subgroup_results, code doesn't seem to filter to mothers == TRUE, so i think generates predictions for each distinct age (so nrow(df) == unique(age)) but i think (and could be r versions thing) might end up then predicting at covar values for race/etc for a mix of mothers and non mothers?

to_predict <- d_case %>%
    group_by(age) %>%
    filter(1:n() == 1) %>%
    group_by()

image

In general, I was less sure about how to think about the age-subgroup predictions so might also be intentional

Smaller clarifications:

Right now, the first part does checks on inclusion like non-missing wage, to take sample down from 25160 to 18075. But then the flow is:

  1. Subsetting from d_all to d by removing those with missing logged wages
d <- d_all %>%
  ## Remove those with missing hourly wages. We estimate models without them
  filter(!is.na(ln_wage)) %>%
  mutate(num_with_wage = n())
  1. right_join to d_all rather than d in below chunk re-includes those folks in order to check for all groups:
support_data <- d %>%
  group_by(age, educ, race, married) %>%
  mutate(has_support = n_distinct(mother) == 2) %>%
  filter(1:n() == 1) %>%
  select(age, educ, race, married, has_support) %>%
  right_join(d_all, by = c("age", "educ","race","married")) %>%
  mutate(has_support = ifelse(is.na(has_support),F,has_support))
  1. Then, function uses support_data and only filters on basis of has_support == TRUE

I think the missing wage ppl end up getting implicitly excluded through the na.rm = TRUE in weighted mean and listwise deletion in lm, but thought it might be good here in the updated code if I include the missing wage flag, in addition to the has_support flag in the filter_flags vector and using it in initial restriction (haven't updated yet)

d_case <- full_data[rowSums(full_data[ , filter_flags]) == length(filter_flags),
                      ]

This is just clarifying since I'm less clear on rationale for normalizing weights other than wanting to sum to full sample size, but wasn't sure if there's a clear reason to not normalize weights within train and test in the make_cv_results call?

I'm less familiar with the structure of ipums, but it seems like for the support_data, there's fewer unique identifiers (serial since all same month/year) than rows, and that this indicates that there are observations with multiple persons in the household as indicated via the pernum indicator. Is there a reason to retain multiple (what code currently does) or is pernum = 1 special in some way that justifies filtering to pernum == 1?

ilundberg commented 4 years ago

I'll use your names for each checkbox and answer these here. Then, I'll send another comment with some structural changes.

ilundberg commented 4 years ago

Some small tweaks. Then, I'll write again with more structural things.

ilundberg commented 4 years ago

More structural thing:

I think there is a tension between

  1. Producing code that clearly reproduces our paper and which people can easily read to understand.
  2. Producing code that people can pick up like a software package and use for other problems.

In some ways, your revisions pushed toward (2). For example, you create objects to hold the names of the outcome and covariates, and create model formulas with objects like this: formula(sprintf("%s ~ %s + %s", outcome_varname, treatment_varname, paste(control_vec, collapse = "+")) Those choices seem appropriate for a software package (goal 2), but I think they get in the way of a readable replication package (goal 1). I think we're not going for the second goal anyhow, because there are still hard-coded things like a model object called ols_ageFactor_fit.

In my revision, I went back to the structure of my original code. In this approach, we only make things arguments to a function (e.g. weight_name) when it is something that we want to hand to the function with several different values in our application. I think this seems preferable for goal (1).

rebeccajohnson88 commented 4 years ago

Sounds good on all of the above --- that makes sense re the person versus household distinction and I agree that there isn't strong need to normalize weights within a given fold of train/test. closing this and feel free to delete!