ngreifer / cobalt

Covariate Balance Tables and Plots - An R package for assessing covariate balance
https://ngreifer.github.io/cobalt/
73 stars 11 forks source link

SMDs not showing for all levels #84

Open cegepi opened 6 months ago

cegepi commented 6 months ago

Thank you for the incredible package that is cobalt! This may be more of a question or enhancement more than a "bug" per se:

I have a dataset where there is a categorical grouping variable for year of starting a drug, called index_year_cat. There are 3 possible levels of this factor variable: 2010-2011, 2012-2015, and 2016-2017. In the whole dataframe each level is populated and the bal.tab function works great, providing SMD in the weighted population at all three levels. 👍

However, I am running into an issue when I do a stratified analysis where I begin by filtering/taking a subset of the original population. In this sub-population, only two levels of index_year_cat are populated: by chance, nobody is in the 2010-2011 level. When I run bal.tab it only shows the SMD in the weighted sub-population for 2016-2017, but not also for 2012-2015. It's perhaps seeing it as dichotomous variable and treating it like other yes/no variables where we only want to report the yes responses (like presence or absence of a medication). Is there a way to override this so that all SMD are calculated for all present levels?

Thanks for your assistance!

Charley

ngreifer commented 6 months ago

Hi Charley,

Thank you for the kind words!

When you are doing the "stratified analysis", are you using the cluster or subset arguments to bal.tab() with the full dataset supplied, or are you just subsetting your dataset and supplying that to bal.tab()? If you do the former, all factor levels in the original dataset should appear. If you are subsetting your dataset and supplying it to bal.tab(), then it removes unobserved factor levels.

cegepi commented 6 months ago

Thank you for the quick reply, Noah! My workflow:

  1. subset to the group of interest (i.e., groups of a comorbidity score)
  2. calculate the propensity scores and IPTWs in the subset (comparing drug A and drug B)
  3. Supply subset to bal.tab() to generate SMDs

To clarify, removing unobserved levels of a covariate, such as the first group (2010-2011) which as no individuals when working in the subset cohort, makes sense--the issue is that are are two remaining levels (2012-2015 and 2016-2017) which are both observed in the subset, yet bal.tab() is only showing SMD for 2016-2017, not the 2012-2015 level too.

Sounds like I should consider trying a clusteror subset workflow--but I am calculating propensity scores and weights after subsetting, so the weights from generated from the whole cohort dataset but applied to a subset would be incorrect to use...any suggestions?

Thanks again!

ngreifer commented 6 months ago

Ah, I understand, thank you for the clarification. You're saying because removing the unobserved level reduces the number of levels to 2, the functionality whereby a 2-level factor is reduced to just one balance statistic is causing the issue. I don't know if there is a straightforward way to address this without completely disabling the functionality of reporting a single balance statistic for a 2-level factor. I think the way the variables are processed involves losing the information about the third level, which would be necessary to distinguish between 2 levels of a 3-level factor with one level unobserved, which should produce 2 statistics, and 2 levels of a 2-level factor, which should be simplified into a single statistic.

I would recommend using a subset or cluster approach. If you are using WeightIt to estimate the weights, you can just use the by argument, which automatically estimates the weights separately within each level of the supplied subgrouping variable. For a method that uses M-estimation where you want to retain the ability to use M-estimation standard errors (which are disabled when using by currently), you can interact the subgrouping variable with all variables in the model formula, e.g., treat ~ g * (x1 + x2 + x3), which does the same thing when using these methods (e.g., GLM, CBPS, entropy balancing, IPT). Then you can supply the grouping variable to cluster in bal.tab(), which should retain all factor levels as long as they are present in at least one subgroup.