Unexpected parameter estimates for some data generating processes

Hello, I have a data generating process where it seems like lmtp gives an incorrect effect estimate. For $t = 1, \dots, \tau$, generate exposure variables as $$A_t \sim \mathrm{Categorical}((0, 1, 2, 3), (0.25, 0.25, 0.25, 0.25)).$$ That is, at each timepoint $A_t$ takes one of the values $(0, 1, 2, 3)$ with equal probability. The outcome is given by $$Y \sim \mathrm{Normal}(A_1, 0.01^2).$$

The modified treatment policy is to shift the exposure variable up by 1, within the support of the data: $$d(a) = I[a < 3] * (a + 1) + 3 \cdot I[a = 3].$$

I would expect the true effect estimate to be $0.25 \cdot 1 + 0.25 \cdot 2 + 0.5 \cdot 3 = 2.25.$ However, lmtp gives an estimate of $3$ (reprex included below).

I suspect this is linked to how lmtp estimates the sequential regressions. The relevant code in the tmle.R file is:

m_shifted_train[jt & rt, t] <- bound(SL_predict(fit, shifted$train[jt & rt, vars]), 1e-05)
m_shifted_valid[jv & rv, t] <- bound(SL_predict(fit, shifted$valid[jv & rv, vars]), 1e-05)

Here, the shifted predictions are being generated using a dataset in which every exposure variable has had the modified treatment policy applied to it.

In the LMTP paper, Equation (2) gives the following for the identification result: $$m_t : (a_t, ht) \mapsto E[m{t+1}(A{t+1}^d, H{t+1}) | A_t = a_t, H_t = h_t]$$ and then the final parameter value is given by $$\theta = E[m_1(A_1^d, L_1)].$$ Taken together, this suggests that the shifted predictions should be based on a datasets in which only one exposure variable is shifted at a time (the conditional expectation is conditioned on $A_t = a_t^d, H_t = h_t$, where $h_t$ includes all prior exposures without any modified policy assigned). For the purposes of the reprex given below, this can be achieved in the code in an ad-hoc way by replacing those lines in tmle.R with

# Create a new shifted dataset where only the exposure at time t is intervened on
new_shifted <- natural
new_shifted$train[[paste0("A_", t)]] <- shifted$train[[paste0("A_", t)]]
new_shifted$valid[[paste0("A_", t)]] <- shifted$valid[[paste0("A_", t)]]

m_shifted_train[jt & rt, t] <- bound(SL_predict(fit, new_shifted$train[jt & rt, vars]), 1e-05)
m_shifted_valid[jv & rv, t] <- bound(SL_predict(fit, new_shifted$valid[jv & rv, vars]), 1e-05)

With this modification, in the example below lmtp gives the expected effect estimate.

Thanks!

Herb

reprex

library(lmtp)
#> Warning: package 'lmtp' was built under R version 4.2.3

simulate_data <- function(seed, N, tau, tstar = 1) {
  set.seed(seed)

  data <- data.frame(id = 1:N)

  for(t in 1:tau) {
    # Time-varying cofounder
    Lt <- paste0("L_", t)
    data[[Lt]] <- rbinom(N, 1, 0.5)

    # Treatment
    At <- paste0("A_", t)
    data[[At]] <- sample(0:3, size = N, replace = TRUE)
  }

  # Outcome
  mu <- data[[paste0("A_", tstar)]]
  data$Y <- rnorm(N, mu, 0.01)

  data
}

# Expected true effect
mean(c(1, 2, 3, 3))
#> [1] 2.25

tau <- 3
N <- 1e3
tstar <- 1

dat <- simulate_data(153, N, tau, tstar)

a <- paste0("A_", 1:tau)
w <- Map(\(t) paste0("L_", t), 1:tau)
y <- "Y"

learners <- c("SL.mean", "SL.glm", "SL.ranger")

policy <- \(data, trt) {
  ifelse(data[[trt]] == 3, 3, data[[trt]] + 1)
}

fit <- lmtp_tmle(
  data = dat,
  trt = a,
  time_vary = w,
  outcome = y,
  shift = policy,
  mtp = FALSE,
  learners_outcome = learners,
  learners_trt = learners,
  outcome_type = "continuous",
  folds = 5
)
#> Loading required package: nnls
#> Warning: package 'nnls' was built under R version 4.2.3
#> Loading required namespace: ranger

fit
#> LMTP Estimator: TMLE

#> Trt. Policy: (policy)
#> Population intervention estimate

#> Estimate: 3.0024

#> Std. error: 2e-04

#> 95% CI: (3.0019, 3.0028)

^{Created on 2024-02-28 with reprex v2.1.0}

Please include your R session info:

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] nnls_1.5   lmtp_1.3.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.12         pillar_1.9.0       
 [3] compiler_4.2.1      progressr_0.14.0   
 [5] iterators_1.0.14    tools_4.2.1        
 [7] digest_0.6.34       tibble_3.2.1       
 [9] evaluate_0.23       lifecycle_1.0.4    
[11] checkmate_2.3.1     lattice_0.20-45    
[13] pkgconfig_2.0.3     rlang_1.1.3        
[15] reprex_2.1.0        Matrix_1.4-1       
[17] foreach_1.5.2       origami_1.0.7      
[19] cli_3.6.2           rstudioapi_0.15.0  
[21] yaml_2.3.8          parallel_4.2.1     
[23] xfun_0.42           fastmap_1.1.1      
[25] ranger_0.16.0       SuperLearner_2.0-29
[27] withr_3.0.0         knitr_1.45         
[29] vctrs_0.6.5         generics_0.1.3     
[31] fs_1.6.3            globals_0.16.2     
[33] grid_4.2.1          glue_1.7.0         
[35] data.table_1.15.0   listenv_0.9.1      
[37] R6_2.5.1            processx_3.8.3     
[39] fansi_1.0.6         future.apply_1.11.1
[41] parallelly_1.37.0   rmarkdown_2.25     
[43] clipr_0.8.0         callr_3.7.5        
[45] magrittr_2.0.3      ps_1.7.6           
[47] htmltools_0.5.7     backports_1.4.1    
[49] codetools_0.2-18    splines_4.2.1      
[51] assertthat_0.2.1    abind_1.4-5        
[53] future_1.33.1       utf8_1.2.4         
[55] gam_1.22-3

nt-williams / lmtp

Unexpected parameter estimates for some data generating processes #130