Knot placement for non-time spline terms

LasseHjort commented 4 years ago

Hi Mark

I discovered that if you use spline terms for covariate effects without specifying knot placement (i.e., only specifying df = x), the knot placement will be based on the distribution of the covariate only among those with events. I think the issue exists both in stpm2 and pstpm2. A quick example:

library(rstpm2)
data(colon)
colon$follow_up <- as.numeric(colon$exit - colon$dx)
colon$status <- as.numeric(colon$status %in% c("Dead: cancer", "Dead:other"))
fit <- stpm2(Surv(follow_up, status) ~ ns(age, df = 4), data = colon, df = 6)
fit@terms # 0% = 20, 25% = 63, 50% = 72, 75% = 79, 100% = 96 knot placement for age spline

#Compare with age quantiles
quantile(colon$age, probs = c(0, 0.25, 0.5, 0.75, 1))
#Compare with age quantiles of those with `events`
quantile(colon$age[colon$status == 1], probs = c(0, 0.25, 0.5, 0.75, 1))

I think this is due to the way the initial values are calculated. Of cause, users can just provide knots manually for spline terms other than the time spline, but it is definitely convenient to simply type df = x. Is this at all possible to fix?

mclements commented 4 years ago

[Lightly edited copy of my email reply]

Lasse,

Good to hear from you.

There are good reasons to specify the spline terms for the time effects using data for the events: for a rare event with administrative censoring, if all of the data were used to specify the splines, then most of the knots would be close to the end of follow-up rather than the events. Both Stata's and rstpm2's implementations determine the knots for times based on individuals with events.

Stata's stpm2 specifies non-time-based covariates from the full dataset, so that most of those spline-based effects will be based on the full dataset. Surprisingly, Stata's stpm2 also specifies the orthogonalisation matrix for the time-based splines using the full dataset, which I think is inconsistent with the knot determination.

The current implementations in rstpm2 specifies all of the design information based on the individuals with events -- that is, there is one model frame based on stats::glm or mgcv::gam for stpm2 or pstpm2, respectively, which is used to generate the design matrix for the full dataset. The alternative would require that two model frames would be need and then checks would be required whether the combined model matrix is full rank. Moreover, predictions would need to deal with the two model frames.

As a thought experiment, if those without events have very different covariate patterns to those with events, then we could argue for an average covariate pattern across the full dataset. However, for a rare event, the average covariate spline support may be poor for the events, which would lead to issues with estimation. This suggests that specifying the splines using the dataset restricted to the events may be more suitable for rare events.

Do you have a strong argument for using the two model frames?

Kindly, Mark.

LasseHjort commented 4 years ago

Hi Mark

Thanks for the explanation. Seems like a lot of additional work to add the extra model frames, when the user can simply define the knots beforehand. However, if for instance the mortality is much higher among elderly individuals in a dataset, the knots would primarily be located at higher ages. But we are still interested in also modelling the age effect correctly among the younger individuals, which could be problematic if there aren’t a sufficient number of knots in that part of the age scale. I think for most application, there will not be little difference between the results (for the colon cancer dataset the knots are fairly close), but it may be difficult to explain in a paper why the knots for non-time effects were only based on those will events (with not reference to cite).

Best, Lasse

Fra: mclements [mailto:notifications@github.com] Sendt: 16. april 2020 10:19 Til: mclements/rstpm2 rstpm2@noreply.github.com Cc: Lasse Hjort lahja@dcm.aau.dk; Author author@noreply.github.com Emne: Re: [mclements/rstpm2] Knot placement for non-time spline terms (#16)

[Lightly edited copy of my email reply]

Lasse,

Good to hear from you.

There are good reasons to specify the spline terms for the /time effects/ using data for the events: for a rare event with administrative censoring, if all of the data were used to specify the splines, then most of the knots would be close to the end of follow-up rather than the events. Both Stata's and rstpm2's implementations determine the knots for times based on individuals with events.

Stata's stpm2 specifies non-time-based covariates from the full dataset, so that most of those spline-based effects will be based on the full dataset. Surprisingly, Stata's stpm2 also specifies the orthogonalisation matrix for the time-based splines using the full dataset, which I think is inconsistent with the knot determination.

The current implementations in rstpm2 specifies all of the design information based on the individuals with events -- that is, there is one model frame based on stats::glm or mgcv::gam for stpm2 or pstpm2, respectively, which is used to generate the design matrix for the full dataset. The alternative would require that two model frames would be need and then checks would be required whether the combined model matrix is full rank. Moreover, predictions would need to deal with the two model frames.

As a thought experiment, if those without events have very different covariate patterns to those with events, then we could argue for an average covariate pattern across the full dataset. However, for a rare event, the average covariate spline support may be poor for the events, which would lead to issues with estimation. This suggests that specifying the splines using the dataset restricted to the events may be more suitable for rare events.

Do you have a strong argument for using the two model frames?

Kindly, Mark.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmclements%2Frstpm2%2Fissues%2F16%23issuecomment-614492039&data=02%7C01%7C%7Cc38056677ff045e77dca08d7e1def6de%7C5968b90c51a64f088b4750ffffbe2e4f%7C0%7C0%7C637226220056082392&sdata=3D44r8gFj%2BvI4%2BqF8b3TIGtZvc0cSPF3dNq607LYgag%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADGLHRMFYQWMC6LRDQMT5EDRM25QZANCNFSM4MJJFEGQ&data=02%7C01%7C%7Cc38056677ff045e77dca08d7e1def6de%7C5968b90c51a64f088b4750ffffbe2e4f%7C0%7C0%7C637226220056093180&sdata=fV8bTJS%2F7VBKARgZ9j50QbtZL14I%2FO3LgHNTrYszNmc%3D&reserved=0.

mclements commented 3 months ago

@LasseHjort : I am going to close this old issue. Sincerely, Mark.

mclements / rstpm2

Knot placement for non-time spline terms #16