ngreifer / cobalt

Covariate Balance Tables and Plots - An R package for assessing covariate balance
https://ngreifer.github.io/cobalt/
73 stars 11 forks source link

love.plot with aggregation for multiple matched samples from the same data #34

Closed black-benjamin closed 4 years ago

black-benjamin commented 4 years ago

Hello, I want to produce a love plot of the mean covariate balance adjustments from 4 matched samples of the same data (the number of observations are too large to match in their entirety)

I have tried this two ways so far, first by creating a single match object (using the matching package) that has all four samples included and matched seperately as clusters (matchby)

However when I try to use this object in a bal.tab or love.plot call I get an error: Error in names(object) <- nm : 'names' attribute [400000] must be the same length as the vector [1] In addition: Warning messages: 1: Deprecated 2: Deprecated

This is my script:

Matching call with exact matching on sample no.

SP_2010_GLM <- SP_2010_samples_combined %>% glm(formula = FCL_out ~ SFCL + ELC_Dist + Pop + Slope + Precip + Elevation + Cap_dist + Border_dist + Road_dist + Soil, family = binomial()) SP_2010_covs <- subset(SP_2010_samples_combined, select = -c(T_C, Temp, FCL_out, sample_no.))

X1 <- SP_2010_GLM$fitted #the propensity score Y1 <- SP_2010_samples_combined$FCL_out #the outcome Tr1 <- SP_2010_samples_combined$T_C #a vector of the treatment

SP_2010_combined_match <- Matchby(Y=Y1, Tr=Tr1, X=X1, by= SP_2010_samples_combined$sample_no., M=1, replace= TRUE, caliper = 0.5, Weight=1, ties = FALSE) summary(SP_2010_combined_match)

Call to bal.tab

SP_2010_combined_balance <- bal.tab(SP_2010_combined_match, treat = SP_2010_samples_combined$T_C, cluster = "sample_no.", distance = X1, covs = SP_2010_covs, un = TRUE, stats= c("mean.diffs", "ks.statistics")) `

Alternatively I have tried creating a vector of the sperate match objects after performing matching for each of the samples seperately and then introducing this through the 'weights' specification in love.plot as you suggested in another issue:

library(cobalt) library(purrr)

match_objects <- vector(mode = "list") match_objects$sample1 <- Sample1_match match_objects$sample2 <- Sample2_match match_objects$sample3 <- Sample3_match match_objects$sample4 <- Sample4_match

match_formula <- SP_2010_sample1 %>%formula(FCL_out ~ SFCL + ELC_Dist + Pop + Slope + Precip + Elevation + Cap_dist + Border_dist + Road_dist + Soil)

love.plot(match_formula, data = SP_2010_sample1, weights = map(match_objects, get.w))

However this call runs a long time without producing a result, where am I going wrong?

Apologies if my explanation is unclear I am still relatively new to R. Many thanks, Ben.

ngreifer commented 4 years ago

Hi Ben,

Thanks for letting me know about this and sorry you're having trouble. The released version of cobalt is not compatible with the output of Matchby(). You can "cheat" this by reassigning the output of a call to Matchby() the class "Match", using the following code:

class(SP_2010_combined_match) <- "Match"

That should work for now. I've made it so that future versions of cobalt are automatically compatible with Matchby objects, so thank you for pointing that out.

Regarding your second method: if you want one set of matching weights to apply to your entire combined sample, you need to supply the combined dataset to data and a single vector of the matching weights to weights. It looks like you gave weights a list of weights, one for each dataset, but supplied one dataset to data. Supplying a list of weights to weights is for when you want to apply multiple sets of weights to the same dataset, e.g., when comparing multiple weighting or matching methods to the same data. My understanding is that you wanted to supply one set of weights that represents all the matching weights for the combined sample. To do so, you would need weights = unlist(match_objects, get.w) and you would supply SP_2010_samples_combined to data. If you want to check balance just on SP_2010_sample1, which I assume is just one of the four samples split up by sample_no., you would supply SP_2010_sample1 to data and get.w(match_objects$sample1) to weights.

If love.plot() is taking a long time with the output, first try supplying it to bal.tab() using the same code in order to isolate the problem, since love.plot() calls bal.tab() on its inputs. With your sample size (400000) and small number of covariates, bal.tab() shouldn't face speed issues, so I'm not exactly sure what's going on there. Hopefully the other fixes work.

Also, something I noticed is that in your propensity score formula, you used the outcome variable as the response, when you should have the treatment variable there.

black-benjamin commented 4 years ago

Hi Noah, Thanks for your response and let me also say that what you've achieved with the cobalt package is fantastic and I love how extensive the vignettes are (especially for a relative beginner).

I tried the two different solutions you suggested and unfortunately I'm still not able to achieve what I want but thats probably because I didn't express what that is clearly enough in my first post. What I would ultimately like to produce is a love plot showing the absolute standardised mean differences and KS statistic values for each covariate that are the averages of those values obtained across my four samples. The samples being matching objects using the same matching approach but on four different random samples of the data. If it was possible I would simply create love plots for each of the samples and then add them together as ggplot objects and summarising as mean values, but unless I'm missing something this isn't possible.

Ideally I want to avoid the first approach of using 'matchby' and then calculating the balance statistics using the 'cluster' option in bal.tab as whilst this might work in the case of 4 samples (n= 400,000) my next step is to do this for 12 samples (n = 1,200,000) which becomes too slow for me to perform the matching.

So the second option using 'weights' is preferable but this is where I ran into more problems. Firstly I'm confused about which object to enter for the 'data=' argument because all of the matching results come from different samples of the data? I tried just entering the one of the matching objects for this with each sample included seperately within the weights argument as follows:

match_objects <- vector(mode = "list") match_objects$sample1 <- Sample1_match match_objects$sample2 <- Sample2_match match_objects$sample3 <- Sample3_match match_objects$sample4 <- Sample4_match

average_balance <- bal.tab(T_C ~ SFCL + ELC_Dist + Pop + Slope + Precip + Elevation + Cap_dist + Border_dist + Road_dist + Soil, data = Primary_2010_sample1, weights = get.w(match_objects$sample1, match_objects$sample2, match_objects$sample3, match_objects$sample4))

However the bal.tab object that is returned from this only contains the post-matching covariates values for sample1 and none of the others? After this I also tried it with the data argument equal to the combined dataset of the samples and the weights argument set as you suggested: = unlist(match.objects, get.w)

balance <- bal.tab(T_C ~ SFCL + ELC_Dist + Pop + Slope + Precip + Elevation + Cap_dist + Border_dist + Road_dist + Soil, data = SP_2010_samples_combined, weights = unlist(match_objects, get.w))

But I got the following message: ‘Error: All weights must be numeric.’

Apologies if there is in fact a very simple answer to this problem that I’m missing and I would appreciate any other advice you could give. Thanks, Ben.

ngreifer commented 4 years ago

Hi Ben,

I made a mistake in the code snippet I sent you, so sorry about that. Instead of unlist(match_objects, get.w), it should be unlist(lapply(match_objects, get.w)).

To get the statistics you want, you need to provide all the data and use the cluster option, where the cluster variable corresponds to which sample each unit belongs to. You also need to supply the weights for each of the four datasets. So, ideally, what you want is

It's important that these are all in the same order. The easiest way to do this would be to include the cluster variable and weights variable as variables in the dataset and then supply their names as strings to bal.tab() or love.plot(). Here's how you might be able to do this:

for (i in 1:4) {
    SP_2010_samples_combined$m.weights[SP_2010_samples_combined$sample_no. == i] <- get.w(match_objects[[i]])
}

balance <- bal.tab(T_C ~ SFCL + ELC_Dist + Pop + Slope + Precip + 
                      Elevation + Cap_dist + Border_dist + Road_dist + Soil, 
        data = SP_2010_samples_combined, weights = "m.weights", cluster = "sample_no.",
        stats = c("m", "ks"), un = TRUE)

love.plot(balance, stats = c("m", "ks"), which.cluster = .none, agg.fun = "mean")

Setting which.cluster = .none and agg.fun = "mean" will give you the mean balance statistics across the samples. This will be straightforward to change when using your 12 samples.

black-benjamin commented 4 years ago

Hi Noah,

Thanks again for another comprehensive response. Your second solution worked for me and I'm very grateful, you'll certainly feature in the acknowledgements of my thesis.