Closed bblodfon closed 1 month ago
Hey John! I think harmonizing is a good idea, and it's much easier to align aorsf
with the other learners than aligning the other learners with aorsf
. I think my rationale was that evaluating model predictions at the times when events occur should improve efficiency versus evaluating the predictions at times around those points or potentially missing event times in testing data that occur before or after the first or last event time in the training data, respectively. But in most cases I think the event times will be very similar in training versus testing data.
See https://github.com/mlr-org/mlr3extralearners/pull/385 for the time point harmonization.
In the code example I now have the 3 RSFs (ranger
, aorsf
and rfsrc
) that provide the unique train event time points, while all the rest of the learners provide the unique train time points for the survival matrix during prediction.
penalized
behaves like RSFs (unique train event times) + adds 0 and the largest time point if it belongs to a censored observationparametric
, rfsrc
, ranger
, akritas
) have an argument to change the granularity (i.e how many) of the time points are used
Investigation
I performed a small benchmark related to this PR - see
reprex
below: I wanted to know across all survivalmlr3
learners that produce a survival matrix (distr
predict type inmlr3proba
), which time points are used as columns.Results
Most survival learners use all the train times points (this plays a large role for computing metrics like eg IBS and making things fair). The different ones are the following:
surv.aorsf
uses the unique event time points form the test set code - maybe it's a good idea to change that and harmonize with the rest of RSFs (for some reasons these learners use the unique event times points from the train set). We can use directly thelearner$model$event_times
slot inpredict()
.akritas
andparametric
have antime
argument (default 150), to "spread out" the time points of the train set time points. The reason for this was efficiency (to NOT have too many time points). We could change that to have the default setting of using the unique train time points from themodel$y[, "time"]
slot, and if users want to usentime
they can do that.penalized
: uses the unique train event times but adds0
and the largest time point (which if it belongs to a censored observation, this is an extra time point) - discussed with author.rfsrc
has also antime
(defaul value:150
) that coerces the unique event times to150
if more than150
exists in the training data. In the below example task this doesn't happen (< 150 events), but it seems that having such a parameter possibly for all learners is a good thing.surv.ranger
also has atime.interest
argument which is likentime
, default there isNULL
(use all observed time points).Created on 2024-09-26 with reprex v2.1.1