imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
775 stars 193 forks source link

Handling left-truncated or left-censored data? #328

Open buttrey opened 6 years ago

buttrey commented 6 years ago

Hi. I have left-truncated data. Normally I would express this by calling Surv (time1, time2, event, type = "counting"), where time1 is the time at which I first saw this observation, time2 is the event or censoring time, and event describes the event. [I'm hoping that type = "counting" specifies that time1 is a truncation time, rather than a left-censoring time.] But although ranger runs with this setup, it produces prediction times that are uniformly 1. Consider the example on the last line of the help page for the pbcseq data set in the survival library. With a little modification, like explicitly adding a column named log.bili, we see this:

rf <- ranger(Surv(time1, time2, event) ~ age + sex + log.bili, data = pbcseq) but then all (predict (rf, data = pbcseq)$preds == 1) # produces TRUE

Every prediction is always 1. (I get the same rtesult with type="counting.") That seems off. Is the story that ranger is not (yet) set up to handle this flavor of Surv() object?

Thanks, Sam Buttrey

mnwright commented 6 years ago

Is the story that ranger is not (yet) set up to handle this flavor of Surv() object?

Exactly. For now we have to check the Surv object and produce an error for your example.

Is there any literature on RF for left censoring or truncation?

buttrey commented 6 years ago

Hi Marvin, Thanks for your speedy and helpful reply.

There is literature on fitting single trees for left-truncation. This paper:

Fu and Simonoff (2017), “Survival trees for left-truncated and right-censored data, with application to time-varying covariate data,” Biostatistics 18 (2), 352-369

describes their approach, which is implemented in the LTRCtrees package at CRAN (and see this vignette: https://cran.r-project.org/web/packages/LTRCtrees/vignettes/LTRCtrees.html) . This package builds on rpart and partykit.

But I haven’t found anything on random forests for survival trees with left-truncation.

Have fun, Sam Buttrey

From: Marvin N. Wright [mailto:notifications@github.com] Sent: Monday, June 25, 2018 6:18 AM To: imbs-hl/ranger Cc: Buttrey, Samuel (Sam) (CIV); Author Subject: Re: [imbs-hl/ranger] Handling left-truncated or left-censored data? (#328)

Is the story that ranger is not (yet) set up to handle this flavor of Surv() object? Exactly. For now we have to check the Surv object and produce an error for your example.

Is there any literature on RF for left censoring or truncation?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/imbs-hl/ranger/issues/328#issuecomment-399948411, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJqqtUfUF8UoFcyYxm3DcqpvWXs5oEuTks5uAOMTgaJpZM4UbvMe.

mnwright commented 6 years ago

Thanks. I probably won't have the time to implement it soon. However, I'm happy to help if someone wants to do so.

buttrey commented 6 years ago

Sure, I understand. Thanks for the replies.

From: Marvin N. Wright [mailto:notifications@github.com] Sent: Tuesday, June 26, 2018 1:35 AM To: imbs-hl/ranger Cc: Buttrey, Samuel (Sam) (CIV); Author Subject: Re: [imbs-hl/ranger] Handling left-truncated or left-censored data? (#328)

Thanks. I probably won't have the time to implement it soon. However, I'm happy to help if someone wants to do so.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/imbs-hl/ranger/issues/328#issuecomment-400227079, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJqqtWBkiCq1jHfwM6hYtinAKBYEFKg1ks5uAfI7gaJpZM4UbvMe.

osorochynskyi commented 3 weeks ago

Hello,

Just letting anyone who wants to fit RF to left-truncated data know that the inability to explicitly take into account left-truncation can be offset by instead modeling on time-on-study and including the left-truncation time as a covariable.

With this approach, the original example (ranger(Surv(time1, time2, event) ~ age + sex + log.bili, data = pbcseq)) would become ranger(Surv(time2 - time1, event) ~ time1 + age + sex + log.bili, data = pbcseq).

Although this may seem irksome when coming from parametric or semi-parametric models, in practice it gives reasonable results. In fact, in medical literature, most studies adapt this approach even when working with parametric models.

All in all, the inability to model left-truncation seems to be a minor hindrance, and that's the reason why no one bothers to implement it.

PS Thanks for the great package !