s3alfisc / summclust

R module for cluster specific information (as in the Stata summclust module)
https://s3alfisc.github.io/summclust/
Other
6 stars 2 forks source link

Add extractor methods #8

Open s3alfisc opened 2 years ago

s3alfisc commented 2 years ago

For

SergeyVAlexeev commented 2 months ago

Title: Request for Cluster-by-Cluster Information Similar to Stata's Default Output

Body:

I'm currently working on a paper recommending the use of summclust for cluster analysis in clinical trials. While using your package, I've noticed a discrepancy between the cluster-by-cluster information output in R versus Stata. I would greatly appreciate assistance in aligning the R output with Stata's default presentation.

Current Situation

Using the following R code with summclust on scrambled_cherish.csv:

# Set a wider display
options(width = 150)

# Read the data
data <- read.csv("scrambled_cherish.csv")

# Create dummy variables
data$treat <- ifelse(data$intervention_factor == "Intervention", 1,
                     ifelse(data$intervention_factor == "Control", 0, NA))
data$elect <- as.numeric(data$adm_cat == "Elective")
data$female <- as.numeric(data$gender == "Female")
data$cog_dec <- ifelse(data$cognitive_impairment_admission == "1", 1,
                       ifelse(data$cognitive_impairment_admission == "0", 0, NA))
data$fun_dec <- ifelse(data$hospital_associated_functional_decline == "1", 1,
                       ifelse(data$hospital_associated_functional_decline == "0", 0, NA))

# Create factor variables
data$hospital <- as.factor(data$hospital)
data$ward <- as.factor(data$ward)

# Load necessary libraries
library(sandwich)
library(summclust)

# Specify the model formula
mf <- formula(
  delirium_new ~ treat + 
  age + 
  female + 
  cognitive_impairment_admission + 
  hospital_associated_functional_decline +
  cci_unadjusted_for_age +
  elect +
  factor(hospital)
)

# Testing clusters
fit <- lm(mf, data = data)
summ <- summclust(fit, params = ~ treat, cluster = ~ ward)
summary(summ)

# Create cluster by cluster information
cluster_by_cluster <- data.frame(
  Cluster = seq_along(summ$N_G),
  Ng = as.vector(summ$N_G),
  Leverage = summ$leverage_g,
  Partial_L = summ$partial_leverage[1,],
  Beta_no_g = summ$beta_jack[1,]
)
print(cluster_by_cluster, row.names = FALSE)

I get this output:

 Cluster Ng  Leverage  Partial_L    Beta_no_g
       1 65 1.5615371 0.13941055 -0.030189263
       2 37 1.2127338 0.10579421 -0.040323642
       3 38 1.0415643 0.09938913  0.021841058
       4 39 1.8487431 0.16080947  0.025654521
       5 59 1.0121208 0.09115682 -0.001107840
       6 65 0.9095753 0.09388891 -0.004918237
       7 44 1.7757687 0.16309955  0.005450797
       8 60 1.6379568 0.14645136  0.008303581

Desired Output

When running a similar analysis in Stata, I get:

  ward_cat |       Ng      Leverage     Partial L.  beta no g    
-----------+-----------------------------------------------------
         1 |       65      1.775769       0.136151   0.005451    
         2 |       37      1.012121       0.070870  -0.001108    
         3 |       38      0.909575       0.083975  -0.004918    
         4 |       39      1.041564       0.114481   0.021841    
         5 |       59      1.561537       0.144157  -0.030189    
         6 |       65      1.848743       0.178644   0.025655    
         7 |       44      1.212734       0.124528  -0.040324    
         8 |       60      1.637957       0.147194   0.008304    

Issue

The values are correct, but the order differs. For example, cluster 1's leverage in Stata is 1.775769, while in R it's 1.5615371 (which corresponds to cluster 5 in Stata).

Request

Could you please provide guidance or adjust the summclust package to allow extraction of cluster-by-cluster information in an order consistent with Stata's output? This would greatly enhance the package's utility for researchers transitioning between R and Stata or comparing results across platforms.

Thank you for your attention to this matter. Your assistance will significantly contribute to the adoption of summclust in clinical trial analysis.

Stata code for reference

// Saving output
log using Mudge_output.log, replace text

// Retrieving and preparing data
import delimited "scrambled_cherish.csv"

// Create the dummy variable for treated 
gen treat = .
replace treat = 1 if intervention_factor == "Intervention"
replace treat = 0 if intervention_factor == "Control"

// Create the dummy variable for elective admissions 
gen elect = (adm_cat == "Elective")

// Female dummy
gen female = (gender == "Female")

// Cognitive decline dummy
gen cog_dec = .
replace cog_dec = 1 if cognitive_impairment_admission == "1"
replace cog_dec = 0 if cognitive_impairment_admission == "0"

// Functional decline dummy
gen fun_dec = .
replace fun_dec = 1 if hospital_associated_functional_d == "1"
replace fun_dec = 0 if hospital_associated_functional_d == "0"

// Hospital categories
encode hospital, generate(hospital_cat)

// Ward categories
encode ward, generate(ward_cat)

// Testing clusters
summclust ///
delirium_new /// primary outcome 
treat /// treatment indicator
age ///
female ///
cog_dec ///
fun_dec ///
cci_unadjusted_for_age ///
elect ///
, fevar(hospital_cat) ///
cluster(ward_cat) ///
table addmeans jack rho(0.5) regtable // choice of tests

s3alfisc commented 2 months ago

Hi @SergeyVAlexeev, thanks for raising this - I'm travelling at the moment so haven't really had time to look into it; I'll do so once I'm back home by the end of the week!