Closed mclements closed 5 years ago
Mark,
thanks for your comment and suggested changes. In fact, the aftreg collection is in need of a general update, and plot.aftreg is especially neglected. I will look at your suggestions and see how they fit into my general plan.
Göran,
As mentioned earlier, I have also drafted a predict.aftreg
function; this accepts a data-frame with covariate values and constructs the design matrix and the predictions (see below). The predictions include the gradient of the hazard, which could be useful for the MLE optimisation. I'm using these gradients for variance estimation for Markov models in the function rstpm2::markov_msm
.
Possibly as a stupid question: why do you need to cluster on id
for calculating the likelihood? I would have thought that the likelihood components would be independent.
Sincerely, Mark.
## License: GPL (>= 2)
## Author: Mark Clements
predict.aftreg <- function (object, type = c("haz", "cumhaz", "density", "surv", "gradh"),
newdata, t=NULL, tmvar=NULL, na.action=na.pass, ...)
{
## utility function
mref <- function(m,i,j) i+(j-1)*nrow(m)
if (!inherits(object, "aftreg"))
stop("Works only with 'aftreg' objects.")
type <- match.arg(type)
if (is.null(t)) {
if (is.null(tmvar)) {
lhs <- object$call$formula[[2]]
expr <- if (length(lhs)==4) lhs[[3]] else lhs[[2]]
t <- eval(expr, newdata, parent.frame())
rm(lhs,expr)
} else
t <- newdata[[tmvar]]
}
if (length(t)<nrow(newdata)) t <- rep(t,length=nrow(newdata))
## BUG: add names to object$levels
if (any(object$isF))
names(object$levels) <- object$covars[object$isF]
Terms <- object$terms
if (!inherits(Terms, "terms"))
stop("invalid terms component of object")
strats <- attr(Terms, "specials")$strata
intercept <- attr(Terms, "intercept")
Terms <- delete.response(Terms)
newframe <- stats::model.frame(Terms, data = newdata,
na.action = na.action,
xlev = object$levels)
if (length(strats)) {
if (!("strata" %in% names(object)))
stop("predictions with strata requires 'aftreg(..., x=TRUE)'")
temp <- untangle.specials(Terms, "strata", 1)
if (length(temp$vars) == 1)
strata.keep <- newframe[[temp$vars]]
else strata.keep <- survival::strata(newframe[, temp$vars], shortlabel = TRUE)
strata.keep <- factor(as.character(strata.keep),levels=levels(object$strata))
strata <- as.numeric(strata.keep)
## remove strats from object$terms
newTerms <- Terms[-temp$terms]
attr(newTerms, "intercept") <- attr(Terms, "intercept")
Terms <- newTerms
} else {
stopifnot(object$n.stata == 1)
strata <- rep(1, nrow(newframe))
}
"%mv%" <- function(a,b) as.vector(a %*% b)
newframe <- stats::model.frame(Terms, data = newdata,
na.action = na.action,
xlev = object$levels)
object$terms <- Terms
ncov <- length(object$means)
if (object$pfixed) {
p <- object$shape[strata]
lambda <- exp(object$coefficients[ncov + strata])
} else {
p <- exp(object$coefficients[ncov + strata * 2])
lambda <- exp(object$coefficients[ncov + strata * 2 - 1])
}
if (ncov) {
x <- model.matrix(object, newframe)[,-1,drop=FALSE] # is the intercept always first?
x <- sweep(x, 2, object$means)
param.scale <- if (object$param=="lifeAcc") -1 else 1
lambda <- lambda *
exp(param.scale*(x %mv% object$coefficients[1:ncov]))
} else {
x <- NULL
}
xx <- t
if (object$dist == "weibull") {
dist <- "Weibull"
haza <- hweibull
Haza <- Hweibull
Surviv <- stats::pweibull
Dens <- stats::dweibull
}
else if (object$dist == "loglogistic") {
dist <- "Loglogistic"
haza <- hllogis
Haza <- Hllogis
Surviv <- pllogis
Dens <- dllogis
}
else if (object$dist == "lognormal") {
dist = "Lognormal"
haza <- hlnorm
Haza <- Hlnorm
Surviv <- stats::plnorm
Dens <- stats::dlnorm
}
else if (object$dist == "ev") {
dist = "Extreme value"
haza <- hEV
Haza <- HEV
Surviv <- pEV
Dens <- dEV
}
else if (object$dist == "gompertz") {
dist = "Gompertz"
haza <- hgompertz
Haza <- Hgompertz
Surviv <- pgompertz
Dens <- dgompertz
## canonical parameterisation
p <- p/lambda
}
if (type=="haz")
return(haza(xx, scale = lambda, shape = p))
if (type=="cumhaz")
return(Haza(xx, scale = lambda, shape = p))
if (type=="density") {
if (object$dist == "lognormal") {
sdlog <- 1/p
meanlog <- log(lambda)
den <- stats::dlnorm(xx, meanlog, sdlog)
}
else
den <- Dens(xx, scale = lambda, shape = p)
return(den)
}
if (type=="surv") {
if (object$dist == "lognormal") {
sdlog <- 1/p
meanlog <- log(lambda)
sur <- stats::plnorm(xx, meanlog, sdlog, lower.tail = FALSE)
}
else {
sur <- Surviv(xx, scale = lambda, shape = p,
lower.tail = FALSE)
}
return(sur)
}
if (type=="gradh") {
mapper <- list(ev="extreme",
weibull="weibull",
loglogistic="loglogistic",
Lognormal="lognormal")
if (object$dist %in% names(mapper)) {
distname <- mapper[[object$dist]]
dd <- survival::survreg.distributions[[distname]]
if (is.null(dd$itrans)) {
trans <- function(x) x
itrans <- function(x) x
dtrans <- function(x) 1
}
else {
trans <- dd$trans
itrans <- dd$itrans
dtrans <- dd$dtrans
}
if (!is.null(dd$dist))
dd <- survival::survreg.distributions[[dd$dist]]
pred <- log(lambda)
scale <- 1/p
u <- (trans(t)-pred) / scale # check dimensions
density <- dd$density(u, list()) # F, 1-F, f, f'/f, f''/f
S <- density[,2]
f <- density[,3]
f.prime <- density[,4]*f
grad.beta <- cbind(x,-1)*(f.prime*S+f^2)*dtrans(t)/(S*scale)^2
grad.logscale <- (-f.prime*u*S/dtrans(t) -
f^2*u/dtrans(t) -
f*S/dtrans(t)) / (S*scale/dtrans(t))^2*scale
out <- matrix(0,nrow(newdata),length(coef(object)))
nc <- ncol(grad.beta)
if (!is.null(x))
out[,1:(nc-1)] <- grad.beta[,1:(nc-1)]
out[mref(out,1:nrow(newdata),nc+strata*2-2)] <- grad.beta[,nc]
out[mref(out,1:nrow(newdata),nc+strata*2-1)] <- -grad.logscale
return(out)
} else {
## gompertz
h <- haza(xx, scale = lambda, shape = p)
grad.beta <- cbind(x,-1)*h*xx/p
grad.logscale <- h
out <- matrix(0,nrow(newdata),length(coef(object)))
nc <- ncol(grad.beta)
if (!is.null(x))
out[,1:(nc-1)] <- grad.beta*x
out[mref(out,1:nrow(newdata),nc+strata*2-2)] <- grad.beta[,nc]
out[mref(out,1:nrow(newdata),nc+strata*2-1)] <- -grad.logscale
return(out)
}
}
}
A quick answer to your question about the necessity of the id variable: The likelihood contribution of an individual depends on her full covariate history, from time zero. For left truncated contributions, a piece from zero to the left truncation point is missing, so we have to make the assumption that the early covariate history is time-constant and equal to the value first observed. For time-varying covariate histories, it is necessary to keep pieces together so we can calculate the cumulative hazard function at any time value.
And thanks for the predict function! As soon as time permits, I will dive into all this.
Perhaps I am misinterpreting you: the log-likelihood for a left truncated and potentially right censored value between time $s$ and time $t$ for individual $i$ and exposure episode $j$ is $\delta{ij} \log(h{ij}(t))-\ints^t h{ij}(u) du$ for event indicator $delta{ij}$ and hazard $h{ij}$. For this formulation, there is no dependence on the exposure history prior to time $s$. I realise that the common formulation for the integral is $H_i(t)-H_i(s)$, which will depend on the exposure history, but this could be reformulated as $H{ij}(t)-H{ij}(s)$ with constant exposure prior to $s$ to get the same integral. Am I misinterpreting you?
Sincerely, Mark.
Mark, this is my first pull request, and I am not familiar with the correct procedure to handle it. So, after a quick scan of your suggestions, I take my chances and "Merge pull request" and see what happens!
I'll look for a suitable reference to the "left truncation business".
Thanks for your contributions!
Göran
By example: Look at the data set 'mort' in eha: id = 3 looks like this:
id enter exit event ses 3 0 13 0 upper 3 13 20 0 lower
He has ses = upper in (0, 13] and ses = lower in (13, 20] if id is given.
Without id, this is two individuals, the first with ses = upper in (0, 13], the second with ses = lower in (0, 20].
This makes a difference:
aftreg(Surv(enter, exit, event) ~ ses, data = mort, id = id)$coef sesupper log(scale) log(shape) -0.3648732 3.5867687 0.3385744
aftreg(Surv(enter, exit, event) ~ ses, data = mort)$coef sesupper log(scale) log(shape) -0.3478530 3.6027349 0.3309767
Göran,
But aftreg.c
has the following code for the first of each id:
a_time = ex->time0[rec] * exp(-alphambz); /* Without */
b_time = ex->time[rec] * exp(-alphambz); /* shape! */
if (ex->ind[rec]){
res1 += log(gamma) - alphambz +
(gamma - 1) * (log(ex->time[rec]) - alphambz) +
log(h0(R_pow(b_time, gamma)));
}
res2 += S0(R_pow(a_time, gamma), log_p) -
S0(R_pow(b_time, gamma), log_p);
which seems to use (0,13] and (13,20] if there are no ids specified. For subsequent values with ids, you have:
a_time = b_time;
b_time = a_time +
(ex->time[rec] - ex->time0[rec]) * exp(-alphambz);
which assumes that the intervals are continuous/adjacent. However, for the mort
dataset, there are two values that are not adjacent:
> mort[19:20,]
id enter exit event birthdate ses
19 15 0.000 0.11 0 1800.385 lower
20 15 6.109 20.00 0 1800.385 lower
> mort[415:416,]
id enter exit event birthdate ses
415 341 0.000 0.392 0 1808.494 lower
416 341 1.392 20.000 0 1808.494 lower
I would have thought that you could use the C code without using ids (or even use matrix code in R
).
Sincerely, Mark.
Göran,
Thank you for the very nice package - I particularly like that
eha::aftreg
can handle left truncated data.I found some puzzling predictions from
plot.aftreg
when I compared estimates withrms::psm
. I hope that the attached pull request addresses those issues. I have described this as tentative, as it seems to be a marked change and I did not follow all of the code logic.I hope that this is helpful. If these changes are correct, then I have also written a
predict.aftreg
function that may be useful.Sincerely, Mark Clements.