goranbrostrom / eha

eha
6 stars 1 forks source link

Tentative changes to plot.aftreg #4

Closed mclements closed 5 years ago

mclements commented 5 years ago

Göran,

Thank you for the very nice package - I particularly like that eha::aftreg can handle left truncated data.

I found some puzzling predictions from plot.aftreg when I compared estimates with rms::psm. I hope that the attached pull request addresses those issues. I have described this as tentative, as it seems to be a marked change and I did not follow all of the code logic.

I hope that this is helpful. If these changes are correct, then I have also written a predict.aftreg function that may be useful.

Sincerely, Mark Clements.

goranbrostrom commented 5 years ago

Mark,

thanks for your comment and suggested changes. In fact, the aftreg collection is in need of a general update, and plot.aftreg is especially neglected. I will look at your suggestions and see how they fit into my general plan.

mclements commented 5 years ago

Göran,

As mentioned earlier, I have also drafted a predict.aftreg function; this accepts a data-frame with covariate values and constructs the design matrix and the predictions (see below). The predictions include the gradient of the hazard, which could be useful for the MLE optimisation. I'm using these gradients for variance estimation for Markov models in the function rstpm2::markov_msm.

Possibly as a stupid question: why do you need to cluster on id for calculating the likelihood? I would have thought that the likelihood components would be independent.

Sincerely, Mark.

## License: GPL (>= 2)
## Author: Mark Clements
predict.aftreg <- function (object, type = c("haz", "cumhaz", "density", "surv", "gradh"), 
                            newdata, t=NULL, tmvar=NULL, na.action=na.pass, ...) 
{
    ## utility function
    mref <- function(m,i,j) i+(j-1)*nrow(m)
    if (!inherits(object, "aftreg")) 
        stop("Works only with 'aftreg' objects.")
    type <- match.arg(type)
    if (is.null(t)) {
        if (is.null(tmvar)) {
            lhs <- object$call$formula[[2]]
            expr <- if (length(lhs)==4) lhs[[3]] else lhs[[2]]
            t <- eval(expr, newdata, parent.frame())
            rm(lhs,expr)
        } else
            t <- newdata[[tmvar]]
    }
    if (length(t)<nrow(newdata)) t <- rep(t,length=nrow(newdata))
    ## BUG: add names to object$levels
    if (any(object$isF))
        names(object$levels) <- object$covars[object$isF]
    Terms <- object$terms
    if (!inherits(Terms, "terms")) 
        stop("invalid terms component of  object")
    strats <- attr(Terms, "specials")$strata
    intercept <- attr(Terms, "intercept")
    Terms <- delete.response(Terms)
    newframe <- stats::model.frame(Terms, data = newdata, 
                                   na.action = na.action,
                                   xlev = object$levels)
    if (length(strats)) {
        if (!("strata" %in% names(object)))
            stop("predictions with strata requires 'aftreg(..., x=TRUE)'")
        temp <- untangle.specials(Terms, "strata", 1)
        if (length(temp$vars) == 1) 
            strata.keep <- newframe[[temp$vars]]
        else strata.keep <- survival::strata(newframe[, temp$vars], shortlabel = TRUE)
        strata.keep <- factor(as.character(strata.keep),levels=levels(object$strata))
        strata <- as.numeric(strata.keep)
        ## remove strats from object$terms
        newTerms <- Terms[-temp$terms]
        attr(newTerms, "intercept") <- attr(Terms, "intercept")
        Terms <- newTerms
    } else {
        stopifnot(object$n.stata == 1)
        strata <- rep(1, nrow(newframe))
    }
    "%mv%" <- function(a,b) as.vector(a %*% b)
    newframe <- stats::model.frame(Terms, data = newdata, 
                                   na.action = na.action,
                                   xlev = object$levels)
    object$terms <- Terms
    ncov <- length(object$means)
    if (object$pfixed) {
        p <- object$shape[strata]
        lambda <- exp(object$coefficients[ncov + strata])
    } else  {
        p <- exp(object$coefficients[ncov + strata * 2])
        lambda <- exp(object$coefficients[ncov + strata * 2 - 1])
    }
    if (ncov) {
        x <- model.matrix(object, newframe)[,-1,drop=FALSE] # is the intercept always first?
        x <- sweep(x, 2, object$means)
        param.scale <- if (object$param=="lifeAcc") -1 else 1
        lambda <- lambda *
            exp(param.scale*(x %mv% object$coefficients[1:ncov]))
    } else {
        x <- NULL
    }
    xx <- t
    if (object$dist == "weibull") {
        dist <- "Weibull"
        haza <- hweibull
        Haza <- Hweibull
        Surviv <- stats::pweibull
        Dens <- stats::dweibull
    }
    else if (object$dist == "loglogistic") {
        dist <- "Loglogistic"
        haza <- hllogis
        Haza <- Hllogis
        Surviv <- pllogis
        Dens <- dllogis
    }
    else if (object$dist == "lognormal") {
        dist = "Lognormal"
        haza <- hlnorm
        Haza <- Hlnorm
        Surviv <- stats::plnorm
        Dens <- stats::dlnorm
    }
    else if (object$dist == "ev") {
        dist = "Extreme value"
        haza <- hEV
        Haza <- HEV
        Surviv <- pEV
        Dens <- dEV
    }
    else if (object$dist == "gompertz") {
        dist = "Gompertz"
        haza <- hgompertz
        Haza <- Hgompertz
        Surviv <- pgompertz
        Dens <- dgompertz
        ## canonical parameterisation
        p <- p/lambda
    }
    if (type=="haz")
        return(haza(xx, scale = lambda, shape = p))
    if (type=="cumhaz")
        return(Haza(xx, scale = lambda, shape = p))
    if (type=="density") {
        if (object$dist == "lognormal") {
            sdlog <- 1/p
            meanlog <- log(lambda)
            den <- stats::dlnorm(xx, meanlog, sdlog)
        }
        else 
            den <- Dens(xx, scale = lambda, shape = p)
        return(den)
    }
    if (type=="surv") {
        if (object$dist == "lognormal") {
            sdlog <- 1/p
            meanlog <- log(lambda)
            sur <- stats::plnorm(xx, meanlog, sdlog, lower.tail = FALSE)
        }
        else {
            sur <- Surviv(xx, scale = lambda, shape = p, 
                          lower.tail = FALSE)
        }
        return(sur)
    }
    if (type=="gradh") {
        mapper <- list(ev="extreme",
                       weibull="weibull",
                       loglogistic="loglogistic",
                       Lognormal="lognormal")
        if (object$dist %in% names(mapper)) {
            distname <- mapper[[object$dist]]
            dd <- survival::survreg.distributions[[distname]]
            if (is.null(dd$itrans)) {
                trans <- function(x) x
                itrans <- function(x) x
                dtrans <- function(x) 1
            }
            else {
                trans <- dd$trans
                itrans <- dd$itrans
                dtrans <- dd$dtrans
            }
            if (!is.null(dd$dist)) 
                dd <- survival::survreg.distributions[[dd$dist]]
            pred <- log(lambda)
            scale <- 1/p
            u <- (trans(t)-pred) / scale # check dimensions
            density <- dd$density(u, list()) # F, 1-F, f, f'/f, f''/f
            S <- density[,2]
            f <- density[,3]
            f.prime <- density[,4]*f
            grad.beta <- cbind(x,-1)*(f.prime*S+f^2)*dtrans(t)/(S*scale)^2
            grad.logscale <- (-f.prime*u*S/dtrans(t) -
                              f^2*u/dtrans(t) -
                                f*S/dtrans(t)) / (S*scale/dtrans(t))^2*scale
            out <- matrix(0,nrow(newdata),length(coef(object)))
            nc <- ncol(grad.beta)
            if (!is.null(x))
                out[,1:(nc-1)] <- grad.beta[,1:(nc-1)]
            out[mref(out,1:nrow(newdata),nc+strata*2-2)] <- grad.beta[,nc]
            out[mref(out,1:nrow(newdata),nc+strata*2-1)] <- -grad.logscale
            return(out)
        } else {
            ## gompertz
            h <- haza(xx, scale = lambda, shape = p)
            grad.beta <- cbind(x,-1)*h*xx/p
            grad.logscale <- h
            out <- matrix(0,nrow(newdata),length(coef(object)))
            nc <- ncol(grad.beta)
            if (!is.null(x))
                out[,1:(nc-1)] <- grad.beta*x
            out[mref(out,1:nrow(newdata),nc+strata*2-2)] <- grad.beta[,nc]
            out[mref(out,1:nrow(newdata),nc+strata*2-1)] <- -grad.logscale
            return(out)
        }
    }
}
goranbrostrom commented 5 years ago

A quick answer to your question about the necessity of the id variable: The likelihood contribution of an individual depends on her full covariate history, from time zero. For left truncated contributions, a piece from zero to the left truncation point is missing, so we have to make the assumption that the early covariate history is time-constant and equal to the value first observed. For time-varying covariate histories, it is necessary to keep pieces together so we can calculate the cumulative hazard function at any time value.

And thanks for the predict function! As soon as time permits, I will dive into all this.

mclements commented 5 years ago

Perhaps I am misinterpreting you: the log-likelihood for a left truncated and potentially right censored value between time $s$ and time $t$ for individual $i$ and exposure episode $j$ is $\delta{ij} \log(h{ij}(t))-\ints^t h{ij}(u) du$ for event indicator $delta{ij}$ and hazard $h{ij}$. For this formulation, there is no dependence on the exposure history prior to time $s$. I realise that the common formulation for the integral is $H_i(t)-H_i(s)$, which will depend on the exposure history, but this could be reformulated as $H{ij}(t)-H{ij}(s)$ with constant exposure prior to $s$ to get the same integral. Am I misinterpreting you?

Sincerely, Mark.

goranbrostrom commented 5 years ago

Mark, this is my first pull request, and I am not familiar with the correct procedure to handle it. So, after a quick scan of your suggestions, I take my chances and "Merge pull request" and see what happens!

I'll look for a suitable reference to the "left truncation business".

Thanks for your contributions!

Göran

goranbrostrom commented 5 years ago

By example: Look at the data set 'mort' in eha: id = 3 looks like this:

id enter exit event ses 3 0 13 0 upper 3 13 20 0 lower

He has ses = upper in (0, 13] and ses = lower in (13, 20] if id is given.

Without id, this is two individuals, the first with ses = upper in (0, 13], the second with ses = lower in (0, 20].

This makes a difference:

aftreg(Surv(enter, exit, event) ~ ses, data = mort, id = id)$coef sesupper log(scale) log(shape) -0.3648732 3.5867687 0.3385744

aftreg(Surv(enter, exit, event) ~ ses, data = mort)$coef sesupper log(scale) log(shape) -0.3478530 3.6027349 0.3309767

mclements commented 5 years ago

Göran,

But aftreg.c has the following code for the first of each id:

    a_time = ex->time0[rec] * exp(-alphambz); /* Without */
    b_time = ex->time[rec] * exp(-alphambz); /* shape!  */
    if (ex->ind[rec]){ 
        res1 += log(gamma) - alphambz + 
        (gamma - 1) * (log(ex->time[rec]) - alphambz) +
        log(h0(R_pow(b_time, gamma)));
    }
    res2 += S0(R_pow(a_time, gamma), log_p) - 
        S0(R_pow(b_time, gamma), log_p);

which seems to use (0,13] and (13,20] if there are no ids specified. For subsequent values with ids, you have:

        a_time = b_time;
        b_time = a_time + 
            (ex->time[rec] - ex->time0[rec]) * exp(-alphambz);

which assumes that the intervals are continuous/adjacent. However, for the mort dataset, there are two values that are not adjacent:

> mort[19:20,]
   id enter  exit event birthdate   ses
19 15 0.000  0.11     0  1800.385 lower
20 15 6.109 20.00     0  1800.385 lower
> mort[415:416,]
     id enter   exit event birthdate   ses
415 341 0.000  0.392     0  1808.494 lower
416 341 1.392 20.000     0  1808.494 lower

I would have thought that you could use the C code without using ids (or even use matrix code in R).

Sincerely, Mark.