kaz-yos / tableone

R package to create "Table 1", description of baseline characteristics with or without propensity score weighting
https://cran.r-project.org/web/packages/tableone/index.html
214 stars 40 forks source link

SMD calculation for multinomial variable #87

Closed ChristelSwift closed 2 years ago

ChristelSwift commented 2 years ago

hi there, i'm trying to understand how the reported SMD is calculated in the case of a multinomial variable.

The only detailed example i could find is here: http://support.sas.com/resources/papers/proceedings12/335-2012.pdf
It uses a Mahalanobis distance method so if we have a variable with k levels, and a treated vs control group, then:

image

where C and T are the vectors of proportions for the levels 2:k for control and for treated, and S is a (k-1) covariance matrix, where each element is calculated in this way:

image

so it seems to suggest we should remove the first level of the variable, but then wouldn't the result be different if we removed a different level?

The calculations can be done in matrix form using the probability vectors for the treated / control groups, so to illustrate the point, let's consider the lalonde dataset:

library(tidyverse)
library(tableone)
library(MatchIt) # to get the Lalonde dataset

table1 <- CreateTableOne(
  vars = "race",
  strata = "treat", 
  data = lalonde, 
  test = FALSE
  )

## include standardized mean difference (SMD)
print(table1, smd=TRUE)
image

=> using tableOne we get an SMD of 1.7 on race which has 3 levels.

If i implement the solution given in the SAS paper:

T_mat <- matrix(
  data = lalonde %>% 
    filter(treat == 1) %>% 
    group_by(race) %>% 
    summarise(n = n()) %>% 
    mutate( pc = n / sum(n)) %>% 
    # drop the first level as we only want categories in 2:k
    filter(race != levels(lalonde$race)[1]) %>% 
    pull(pc),
  nrow = 1,
  dimnames = list(NULL, levels(lalonde$race)[-1])
) 

C_mat <- matrix(
  data = lalonde %>% 
    filter(treat == 0) %>% 
    group_by(race) %>% 
    summarise(n = n()) %>% 
    mutate( pc = n / sum(n)) %>% 
    # drop the first level as we only want categories in 2:k
    filter(race != levels(lalonde$race)[1]) %>% 
    pull(pc),
  nrow = 1,
  dimnames = list(NULL, levels(lalonde$race)[-1])
) 

T_mat_prod <- t(T_mat)  %*% T_mat
C_mat_prod <- t(C_mat)  %*% C_mat

S_mat <- ( T_mat_prod + C_mat_prod ) / 2

# Then replace the diagonal elements:
S_diag <- diag((t(1 - T_mat) %*% T_mat +  t(1 - C_mat) %*% C_mat) / 2)
diag(S_mat) <- S_diag

smd <- sqrt((T_mat - C_mat) %*% solve(S_mat) %*% t(T_mat - C_mat) )
smd

This gives a smd of 1.45 which is different from the tableOne result. Also, the result changes depending on the level that is dropped. So i also tried not dropping any level, but this time i got a result of 4.03:


# using all levels of the variable: 

T_mat <- matrix(
  data = lalonde %>% 
    filter(treat == 1) %>% 
    group_by(race) %>% 
    summarise(n = n()) %>% 
    mutate( pc = n / sum(n)) %>% 
    pull(pc),
  nrow = 1,
  dimnames = list(NULL, levels(lalonde$race))
) 

C_mat <- matrix(
  data = lalonde %>% 
    filter(treat == 0) %>% 
    group_by(race) %>% 
    summarise(n = n()) %>% 
    mutate( pc = n / sum(n)) %>% 
    pull(pc),
  nrow = 1,
  dimnames = list(NULL, levels(lalonde$race))
) 

T_mat_prod <- t(T_mat)  %*% T_mat
C_mat_prod <- t(C_mat)  %*% C_mat

S_mat <- ( T_mat_prod + C_mat_prod ) / 2
S_diag <- diag(( t(1 - T_mat) %*% T_mat +  t(1 - C_mat) %*% C_mat ) / 2)
diag(S_mat) <- S_diag

smd <- sqrt((T_mat - C_mat) %*% solve(S_mat) %*% t(T_mat - C_mat) )
smd # = 4.03

which again is different from the result given by tableOne...

Would you be able to explain what formula is implemented in tableOne?

many thanks!

kaz-yos commented 2 years ago

The code is in https://github.com/kaz-yos/tableone/blob/master/R/modules-smd.R

The implementation is based on Yang and Dalton. You may be right about this could be ordering sensitive.

ChristelSwift commented 2 years ago

i did see the code but found it hard to follow... also there is a section with a double for loop, with the comment "room for improvement". I'm wondering if it wouldn't be faster using a matrix multiplication approach as described above instead of a double for loop?

kaz-yos commented 2 years ago

If I remember correctly the technical paper has an error in the off-diagonal part: https://github.com/kaz-yos/tableone/blob/master/R/modules-smd.R#L28

ChristelSwift commented 2 years ago

ha yes!! that's what it was. Now i'm matching the tableOne result. Thank you! Also, with this correction, i get the same result whichever level of the multinomial i drop which is reassuring :)