ngreifer / cobalt

Covariate Balance Tables and Plots - An R package for assessing covariate balance
https://ngreifer.github.io/cobalt/
72 stars 11 forks source link

error with user defined distance #66

Closed adamssv closed 1 year ago

adamssv commented 1 year ago

Hi, I have been given a very large cohort matched on a previously estimated propensity score (PS) and cannot feasibly re-estimate and rematch. I have the estimated PS and the matched pairs identified.

I would like to be able to examine the balance of various variables according to other categorical variables ("cluster") in the data. (For reasons related to stratifying analysis by a non-treatment variable.)

So, I was hoping to make use of the cobalt tools after calling -matchit-, using the distance parameter to force a user-defined distance.

However, I receive an error with cobalt if I try to look at the balance according to any factor variable (i.e., a "cluster"). I can, however, get -matchit- and -bal.tab- to work fine with a cluster if I generate a dummy propensity score on the same data.

Note the cluster is not including in the distance formula in either case and including it does not make a difference to the error.

The error message is, "Error: The argument to 'cluster' must be a vector of cluster membership or the (quoted) name of a variable in 'data' that contains cluster membership."

Here is example code that generates the error to show what I mean.

df <- tibble(x1=runif(n=100), x2=2*runif(n=100), ps=runif(n=100), trt = (runif(n=100)>0.5), c=(runif(n=100)>0.33 ))

#default distance matching -- this works 
mo <- matchit(formula= trt~ x1+x2, data= df)
bal.tab(mo) #works
bal.tab(mo, cluster="c") #works
love.plot(mo, cluster="c") #work

#user distance values
mo_user <-  matchit(formula= trt~ x1+x2, data=df,  distance=df$ps)
bal.tab(mo_user) #works
bal.tab(mo_user, cluster="c") #error

I would greatly appreciate any help, sorry if I am missing an obvious way to use -cobalt- in this situation.

Thanks

ngreifer commented 1 year ago

Hi Scott,

Very interetsing observation! There are subtle reasons for this due to how bal.tab() finds the original dataset in the matchit output object. When a propensity score is estimated, the dataset is stored in the propensity score model fit, which is stored in the matchit object. Otherwise, the dataset is not stored in the matchit object, and only variables that were used in the matching are stored in the object, so they are all bal.tab() has access to. Basically, it doesn't know c lives in df because df is not contained the matchit object anywhere. To tell bal.tab() where c is, you need to supply it with the original dataset using the data argument, i.e.,

bal.tab(mo_user, cluster="c", data = df)

You may wonder how match.data() knows where the original dataset. It uses a hack that is not always accurate and is less likely to be accurate when using cobalt rather than MatchIt alone. But the hack can fail too, and for that reason we recommend using the data argument with match.data(), too.

Noah

adamssv commented 1 year ago

Hi Noah, Thanks for your quick and kind response.... makes sense. Sorry if "use the data argument" was kind of obvious. Indeed, using the data argument explicitly works with the toy example at least.
Hopefully this might help another user as well. Thanks again, Scott