Closed rebeccajohnson88 closed 4 years ago
I'll use your names for each checkbox and answer these here. Then, I'll send another comment with some structural changes.
[ ] Better understanding why we do not subset to mothers == true when predicting in make_subgroup_results That was an error. Thanks for catching it. I substantially revised the code to avoid problems like this. In the revision, I am making one function that estimates both the aggregate and the age-specific results, so that we cannot have a problem like this again. In that call, we restrict to mothers.
[ ] Explicitly filtering out missing wage from analytic sample I changed the sample restriction process. Now, the READ_RAW_DATA part ends by producing two files: to_fit and to_predict. I think this reduces the risk of confusion.
[ ] Normalizing weights in main estimation versus cross-validation We should normalize everywhere. I believe this only affects the GAM, so we are normalizing by weights / mean(weight) per the MGCV documentation.
[ ] Normalizing weights in main estimation versus cross-validation Thanks. My inclination is to not normalize within each fold because that seems just a bit more complex. The weights total to nearly the same value in every fold because we assign folds systematically, first sorting by weight and then randomly permuting the assignments of each sequential set of 5.
[ ] Reasons to retain multiple persons per serial? Since our aim is to draw inference about persons rather than households, we are keeping all persons in each household. This is also consistent with our use of ASECWT, which is a person-level weight rather than a household-level weight. This does not undermine inference because we aren't assuming a simple random sample; instead, we use the replicate weights to take account of the multistage clustered sampling strategy. So I think multiple units per household is ok.
Some small tweaks. Then, I'll write again with more structural things.
More structural thing:
I think there is a tension between
In some ways, your revisions pushed toward (2). For example, you create objects to hold the names of the outcome and covariates, and create model formulas with objects like this:
formula(sprintf("%s ~ %s + %s", outcome_varname, treatment_varname, paste(control_vec, collapse = "+"))
Those choices seem appropriate for a software package (goal 2), but I think they get in the way of a readable replication package (goal 1). I think we're not going for the second goal anyhow, because there are still hard-coded things like a model object called ols_ageFactor_fit
.
In my revision, I went back to the structure of my original code. In this approach, we only make things arguments to a function (e.g. weight_name) when it is something that we want to hand to the function with several different values in our application. I think this seems preferable for goal (1).
Sounds good on all of the above --- that makes sense re the person versus household distinction and I agree that there isn't strong need to normalize weights within a given fold of train/test. closing this and feel free to delete!
Bigger picture:
make_subgroup_results
In
make_aggregate_results
, after the estimation/in line with the procedure as described in text, the code filters to mothers == true so that predictions are at observed covariate values for mothers when estimating differenceIn equivalent process for
make_subgroup_results
, code doesn't seem to filter to mothers == TRUE, so i think generates predictions for each distinct age (so nrow(df) == unique(age)) but i think (and could be r versions thing) might end up then predicting at covar values for race/etc for a mix of mothers and non mothers?In general, I was less sure about how to think about the age-subgroup predictions so might also be intentional
Smaller clarifications:
Right now, the first part does checks on inclusion like non-missing wage, to take sample down from 25160 to 18075. But then the flow is:
d_all
tod
by removing those with missing logged wagesd_all
rather thand
in below chunk re-includes those folks in order to check for all groups:I think the missing wage ppl end up getting implicitly excluded through the na.rm = TRUE in weighted mean and listwise deletion in lm, but thought it might be good here in the updated code if I include the missing wage flag, in addition to the has_support flag in the
filter_flags
vector and using it in initial restriction (haven't updated yet)This is just clarifying since I'm less clear on rationale for normalizing weights other than wanting to sum to full sample size, but wasn't sure if there's a clear reason to not normalize weights within train and test in the
make_cv_results
call?I'm less familiar with the structure of ipums, but it seems like for the
support_data
, there's fewer unique identifiers (serial
since all same month/year) than rows, and that this indicates that there are observations with multiple persons in the household as indicated via thepernum
indicator. Is there a reason to retain multiple (what code currently does) or is pernum = 1 special in some way that justifies filtering to pernum == 1?