joshuaschwab / ltmle

Longitudinal Targeted Maximum Likelihood Estimation package
http://joshuaschwab.github.io/ltmle/
23 stars 16 forks source link

Issues with binary outcomes using SuperLearner #20

Open philipclare opened 5 years ago

philipclare commented 5 years ago

Hi,

I'm having an issue with binary outcomes. From the source, and discussions in other places, it appears that in order to handle continuous outcomes, the program transforms them into the range 0<x<1 and then uses pseudobinomial analysis. However, it appears that is causing issues with some SuperLearner wrappers, even when the variables included are true binomial variables. For example, when using the 'SL.ranger' wrapper to classify using random forests, an error is returned. I may be misunderstanding something, but I think maybe this could be fixed by using a quasibinomial family for transformed continuous variables, but binomial for binary variables?

A replicable (hopefully) example:

cloudstor <- "C:/Users/z3312911/Cloudstor/"
.libPaths(paste0(cloudstor,"R Library"))

library("tmle")
library("ltmle")
library("haven")
library("SuperLearner")
library("simcausal")
library("gam")
library("ranger")

# Data creation using simcausal
D <- DAG.empty() + 
  node("ba", distr="rnorm", mean=0, sd = 1) +
  node("bb", distr="rnorm", mean=0, sd = 1) +
  node("bc", distr="rnorm", mean=0, sd = 1) +
  node("u", t=0, distr="rnorm", mean=0, sd = 1) +
  node("l", t=0, distr="rbern", prob=plogis(-2 + 1.5*u[t] + 0.1*ba - 0.1*bb + 0.1*bc)) + 
  node("a", t=0, distr="rbern", prob=plogis(-2 + 1.0*l[t] + 0.2*ba - 0.2*bb + 0.2*bc)) +
  node("u", t=1:4, distr="rnorm", mean=0.7*u[t-1], sd = 1) +
  node("l", t=1:4, distr="rbern", prob=plogis(-2 + 1.0*a[t-1] + 2.0*l[t-1] + 1.5*u[t] + 0.1*ba - 0.1*bb + 0.1*bc)) +
  node("a", t=1:4, distr="rbern", prob=plogis(-2 + 2.0*a[t-1] + 1.0*l[t] + 0.2*ba - 0.2*bb + 0.2*bc)) +
  node("y", t=4, distr="rbern", prob=plogis(-3 + 0.20*a[t-4] + 0.40*a[t-3] + 0.60*a[t-2] + 0.80*a[t-1] + 1.00*a[t]
                                        + 0.50*l[t-4] + 0.50*l[t-3] + 0.50*l[t-2] + 0.50*l[t-1] + 0.50*l[t]
                                        + 0.50*u[t-4] + 0.50*u[t-3] + 0.50*u[t-2] + 0.50*u[t-1] + 0.50*u[t]
                                        + 0.2*ba - 0.2*bb + 0.2*bc))
  # Set this causal structure, defining all 'u' variables as latent (so they will not be included in the data)
  D <- set.DAG(D, latent.v = c("u_0","u_1","u_2","u_3","u_4"))
  # Create simulated dataset
  data <- sim(D,n=500)

  # LTMLE estimation with repeated observations of exposure but a single outcome comparing exposure at all time points to no exposure
  rltmle1 <- ltmle(data,Anodes=c("a_0","a_1","a_2","a_3","a_4"),
             Lnodes=c("l_1","l_2","l_3","l_4"),
             Ynodes="y_4",
             survivalOutcome=FALSE,
             abar=list(c(1,1,1,1,1),c(0,0,0,0,0)),
             SL.library=c("SL.glm","SL.glm.interaction","SL.gam","SL.ranger"))

Thanks!

joshuaschwab commented 5 years ago

Hi philipclare, I believe ltmle is working as expected and you can use the answer.

> summary(rltmle1)
Estimator:  tmle 
Call:
ltmle(data = data, Anodes = c("a_0", "a_1", "a_2", "a_3", "a_4"), 
    Lnodes = c("l_1", "l_2", "l_3", "l_4"), Ynodes = "y_4", survivalOutcome = FALSE, 
    abar = list(c(1, 1, 1, 1, 1), c(0, 0, 0, 0, 0)), SL.library = c("SL.glm", 
        "SL.glm.interaction", "SL.gam", "SL.ranger"))

Treatment Estimate:
   Parameter Estimate:  0.75259 
    Estimated Std Err:  0.15614 
              p-value:  1.4352e-06 
    95% Conf Interval: (0.44657, 1) 
...

The issue is: 1) the initial SuperLearner Q regression is done on the binary Y - this works fine 2) the subsequent Q regressions are done on Q_{k+1}, which is continuous [0, 1] even though Y is binary - this causes an error in SL.ranger You end up seeing a lot of messages, but these are errors in SL.ranger that cause SuperLearner to give SL.ranger weight 0. SuperLearner still finishes because the other wrappers don't have errors and ltmle also finishes.

You can check that SL.ranger has Coef 0 and Risk NA at all of the L nodes, but nonNA Risk and usually positive Coef at the A and Y nodes (where the regressand is binary):

> rltmle1$fit$Q
[[1]]
[[1]]$l_1
                             Risk      Coef
SL.glm_All             0.01384787 0.8163311
SL.glm.interaction_All 0.01424013 0.1836689
SL.gam_All             0.01391151 0.0000000
SL.ranger_All                  NA 0.0000000

[[1]]$l_2
                             Risk      Coef
SL.glm_All             0.01602560 0.7128688
SL.glm.interaction_All 0.01662685 0.2871312
SL.gam_All             0.01614384 0.0000000
SL.ranger_All                  NA 0.0000000

[[1]]$l_3
                             Risk Coef
SL.glm_All             0.01242406    1
SL.glm.interaction_All 0.01405787    0
SL.gam_All             0.01255452    0
SL.ranger_All                  NA    0

[[1]]$l_4
                              Risk      Coef
SL.glm_All             0.007104126 0.8113366
SL.glm.interaction_All 0.008230640 0.1886634
SL.gam_All             0.007162058 0.0000000
SL.ranger_All                   NA 0.0000000

[[1]]$y_4
                            Risk       Coef
SL.glm_All             0.1248136 0.79882078
SL.glm.interaction_All 0.2314382 0.01448145
SL.gam_All             0.1252425 0.14037659
SL.ranger_All          0.1334619 0.04632119

[[2]]
[[2]]$l_1
                             Risk      Coef
SL.glm_All             0.01983224 0.7186961
SL.glm.interaction_All 0.02036409 0.2813039
SL.gam_All             0.02003470 0.0000000
SL.ranger_All                  NA 0.0000000

[[2]]$l_2
                             Risk      Coef
SL.glm_All             0.02443332 0.2579116
SL.glm.interaction_All 0.02609210 0.1566819
SL.gam_All             0.02442018 0.5854065
SL.ranger_All                  NA 0.0000000

[[2]]$l_3
                             Risk Coef
SL.glm_All             0.01451075    1
SL.glm.interaction_All 0.01778840    0
SL.gam_All             0.01467757    0
SL.ranger_All                  NA    0

[[2]]$l_4
                              Risk      Coef
SL.glm_All             0.006327121 0.0000000
SL.glm.interaction_All 0.007821409 0.1613392
SL.gam_All             0.006265457 0.8386608
SL.ranger_All                   NA 0.0000000

[[2]]$y_4
                            Risk       Coef
SL.glm_All             0.1248136 0.79882078
SL.glm.interaction_All 0.2314382 0.01448145
SL.gam_All             0.1252425 0.14037659
SL.ranger_All          0.1334619 0.04632119

> rltmle1$fit$g
[[1]]
[[1]]$a_0
                            Risk       Coef
SL.glm_All             0.1150444 0.90374677
SL.glm.interaction_All 0.1191908 0.00000000
SL.gam_All             0.1154734 0.09625323
SL.ranger_All          0.1216147 0.00000000

[[1]]$a_1
                            Risk      Coef
SL.glm_All             0.1352237 0.4424547
SL.glm.interaction_All 0.1487119 0.0000000
SL.gam_All             0.1350058 0.2728304
SL.ranger_All          0.1379507 0.2847149

[[1]]$a_2
                            Risk      Coef
SL.glm_All             0.1481390 0.8445075
SL.glm.interaction_All 0.1717063 0.0111711
SL.gam_All             0.1498611 0.0000000
SL.ranger_All          0.1548507 0.1443214

[[1]]$a_3
                            Risk      Coef
SL.glm_All             0.1961866 0.8642027
SL.glm.interaction_All 0.2271313 0.1357973
SL.gam_All             0.1981329 0.0000000
SL.ranger_All          0.2118192 0.0000000

[[1]]$a_4
                            Risk      Coef
SL.glm_All             0.1764165 0.6134281
SL.glm.interaction_All 0.2333325 0.0000000
SL.gam_All             0.1787627 0.0000000
SL.ranger_All          0.1790368 0.3865719

[[2]]
[[2]]$a_0
                            Risk       Coef
SL.glm_All             0.1150444 0.90374677
SL.glm.interaction_All 0.1191908 0.00000000
SL.gam_All             0.1154734 0.09625323
SL.ranger_All          0.1216147 0.00000000

[[2]]$a_1
                            Risk      Coef
SL.glm_All             0.1352237 0.4424547
SL.glm.interaction_All 0.1487119 0.0000000
SL.gam_All             0.1350058 0.2728304
SL.ranger_All          0.1379507 0.2847149

[[2]]$a_2
                            Risk      Coef
SL.glm_All             0.1481390 0.8445075
SL.glm.interaction_All 0.1717063 0.0111711
SL.gam_All             0.1498611 0.0000000
SL.ranger_All          0.1548507 0.1443214

[[2]]$a_3
                            Risk      Coef
SL.glm_All             0.1961866 0.8642027
SL.glm.interaction_All 0.2271313 0.1357973
SL.gam_All             0.1981329 0.0000000
SL.ranger_All          0.2118192 0.0000000

[[2]]$a_4
                            Risk      Coef
SL.glm_All             0.1764165 0.6134281
SL.glm.interaction_All 0.2333325 0.0000000
SL.gam_All             0.1787627 0.0000000
SL.ranger_All          0.1790368 0.3865719

We could consider adding functionality to ltmle to allow the user to specify a separate SL.library for each regression or one for the initial Q regression and another for all subsequent Q regressions. Or we could try to detect which wrappers fail with non-binary regressands, but that might be tricky. Or we could just try to improve the messages (also tricky because these are mostly coming from SuperLearner) and/or documentation.

Josh

ck37 commented 5 years ago

Hi Josh,

Just a quick note that if the outcome variable is non-binary, e.g. a continuous range in [0, 1], then SuperLearner should be used with family = gaussian(). This will be necessary for almost any machine learning algorithm to work correctly.

Thanks, Chris

joshuaschwab commented 5 years ago

Thanks Chris. How about we add an option SL.family with default NULL, where NULL means SuperLearner(..., family = gaussian()) if outcome is continuous in [0, 1] SuperLearner(..., family = binomial()) if outcome is binary but the user can also specify SL.familly = gaussian() or SL.family = binomial() to force one or the other?

Lauren-EylerDang commented 3 years ago

Hi Josh,

I was hoping to use the ltmle package with highly adaptive lasso, but I have this same issue that HAL doesn't support family="quasibinomial" and HAL is dropped from the SL library because the initial outcome variable is binary but the subsequent Q regressions are for a [0,1] outcome. I ended up hard-coding the ltmle estimator using HAL with family="binomial" for the initial Q regression and HAL with family="gaussian" for the subsequent Q regressions. Do you think that is a reasonable method for dealing with this issue? I didn't see documentation for adding a SL.family=NULL option as you described above (I am sorry if I missed it), but is there a way to do what you had described above with the ltmle package?

Thanks, Lauren

joshuaschwab commented 3 years ago

Hi Lauren, Sorry, I never followed up on implementing the SL.family option. Yes, I think your approach sounds reasonable. I don't know if HAL ever returns predictions less than 0 or greater than 1 if all Y are continuous in [0, 1]. If it could, you might need to do something to bound the predictions. You could also write your own HAL wrapper than automatically selects the family, something like:

n <- 100
rexpit <- function(x) rbinom(n=length(x), size=1, prob=plogis(x))
W <- rnorm(n)
A1 <- rexpit(W)
L <- 0.3 * W + 0.2 * A1 + rnorm(n)
A2 <- rexpit(W + A1 + L)
Y <- rexpit(W - 0.6 * A1 + L - 0.8 * A2)
data <- data.frame(W, A1, L, A2, Y)

SL.hal9001.flexible <- function (Y, X, newX = NULL, max_degree = 3, 
                                 fit_type = c("glmnet", "lassi"), 
                                 n_folds = 10, use_min = TRUE, 
                                 family, #not actually used but needed for compatability with SuperLearner
                                 obsWeights = rep(1, length(Y)), ...) {
  if (all(Y %in% c(0, 1, NA))) {
    family <- stats::binomial()
  } else {
    family <- stats::gaussian()
  }
  print(family) #temp
  SL.hal9001(Y, X, newX, max_degree, fit_type, n_folds, use_min, family, obsWeights, ...)
  #may need to bound predictions?
}

ltmle(data, Anodes=c("A1", "A2"), Lnodes="L", Ynodes="Y", abar=c(0, 0), SL.library = "SL.hal9001.flexible", estimate.time = F)

If you try it, let me know how it works. This would be the same idea as the SL.family=NULL option. If it works well, I could add it to the next release. Josh

Lauren-EylerDang commented 3 years ago

Hi Josh,

Thank you for your advice! I tried the code you suggested and didn't get it to work for HAL, but maybe it would work with other learners. I appreciate your time and look forward to the next release!

-Lauren

On Mon, Nov 30, 2020 at 2:40 PM Joshua Schwab notifications@github.com wrote:

Hi Lauren, Sorry, I never followed up on implementing the SL.family option. Yes, I think your approach sounds reasonable. I don't know if HAL ever returns predictions less than 0 or greater than 1 if all Y are continuous in [0, 1]. If it could, you might need to do something to bound the predictions. You could also write your own HAL wrapper than automatically selects the family, something like:

n <- 100 rexpit <- function(x) rbinom(n=length(x), size=1, prob=plogis(x)) W <- rnorm(n) A1 <- rexpit(W) L <- 0.3 W + 0.2 A1 + rnorm(n) A2 <- rexpit(W + A1 + L) Y <- rexpit(W - 0.6 A1 + L - 0.8 A2) data <- data.frame(W, A1, L, A2, Y)

SL.hal9001.flexible <- function (Y, X, newX = NULL, max_degree = 3, fit_type = c("glmnet", "lassi"), n_folds = 10, use_min = TRUE, family, #not actually used but needed for compatability with SuperLearner obsWeights = rep(1, length(Y)), ...) { if (all(Y %in% c(0, 1, NA))) { family <- stats::binomial() } else { family <- stats::gaussian() } print(family) #temp SL.hal9001(Y, X, newX, max_degree, fit_type, n_folds, use_min, family, obsWeights, ...)

may need to bound predictions?

}

ltmle(data, Anodes=c("A1", "A2"), Lnodes="L", Ynodes="Y", abar=c(0, 0), SL.library = "SL.hal9001.flexible", estimate.time = F)

If you try it, let me know how it works. This would be the same idea as the SL.family=NULL option. If it works well, I could add it to the next release. Josh

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/joshuaschwab/ltmle/issues/20#issuecomment-736098818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECY7FPLVZTD6CUICI2475LSSQNNBANCNFSM4FWX3LOQ .

joshuaschwab commented 3 years ago

I just pushed a fix to the SLfamily branch (I didn't end up making a SL.family option - it just happens internally). Will you try it and see if it works for you?

Lauren-EylerDang commented 3 years ago

Hi Josh,

I installed the SLfamily branch and confirmed that ltmle works for ranger. Hal is throwing a different error, but I suspect because ltmle is functional for ranger that this is separate from the family issue. Thank you for working on this issue and happy holidays! -Lauren

On Thu, Dec 10, 2020 at 4:11 PM Joshua Schwab notifications@github.com wrote:

I just pushed a fix to the SLfamily branch (I didn't end up making a SL.family option - it just happens internally). Will you try it and see if it works for you?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/joshuaschwab/ltmle/issues/20#issuecomment-742880835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECY7FM22JNR5YR5X6YHGPTSUFPTDANCNFSM4FWX3LOQ .