Critical bug fix - Affects almost all learners and discrimination measures

RaphaelS1 commented 3 years ago

Bug

The crank return type is defined such that a higher value implies a higher risk of event, however in the listed learners below this has not been the case, affecting all measures of discrimination.

This bug was introduced as several implemented packages do not clearly document how the predicted risk should be interpreted, in some cases a higher value is associated with higher risk of event, in others the reverse is true. In particular the packages {survival} and {mboost} reverse this within a single learner where in some instances the returned linear predictor will be interpreted as "high value = high risk" and in others "high value = low risk".

A secondary fault was also found internally where the default crank was set as the mean survival time, when it should have been the negative mean survival time. In particular surv.parametric was returning wrong crank and distr values, which resulted from a lack of documentation in the survival package indicating that the linear predictor should be interpreted differently from convention.

Fix

For all future work, please upgrade to the following package versions:
- mlr3proba v0.2.4
- mlr3learners v0.4.2
- mlr3extralearners v0.1.1
Fix for predictions made from earlier versions:
- All crank predictions from any of the above learners should be modified by subtracting the prediction from ‘1’.
- Any learners tuned with surv.cindex will be sub-optimal, this cannot be fixed post-hoc

Affected pipeops/pipelines

mlr3proba

PipeOpCrankCompositor
pipeline_crankcompositor

Affected learners

mlr3learners

ranger

mlr3extralearners:

akritas
blackboost
cforest
ctree
coxtime
deephit
deepsurv
dnnsurv
gamboost
glmboost
loghaz
mboost
obliqueRSF
parametric
pchazard
penalized
rfsrc
svm

Future Patches

In addition the following non-critical updates will be made in future patches related to these bugs:

The definition of crank will be reversed along with the cindex calculations. This will have no impact on current code and will not cause any breakages, assuming crank has only been used for measures of discrimination.
PipeOpDistrCompositor will be extended to take into account that the lp should occasionally instead be taken to be -lp.

(cc. @berndbischl @mllg @adibender @fkiraly)

mllg commented 3 years ago

Good summary. It sounds like this issue should be easy to detect with the now fixed sanity check?

fkiraly commented 3 years ago

From our discussion earlier:

I think higher crank should always, and consistently, be associated with higher average survival time
compatibly with that, concordance index in the sense of "fraction of concordant pairs" (concordant = same sign of difference) should be used as a metric

This would require:

changing the sign in some of the interfaced or implemented learners - but actually it would affect only the minority
changing documented convention to higher crank prediction meaning higher predicted average survival time and lower predicted event risk

I would prefer this above the alternative of leaving crank convention as-is, and changing the signs to the opposite. Why: in that alternative scenario, "concordant pairs" would be pairs with an opposite difference sign, which seems counterintuitive and prone to mis-interpretation (and mis-implementation).

In either case, there needs to be a clearly documented extension guideline which makes explicit reference to what - as mathematical objects - the inputs and returns should be. The devil is in the detail, in this case in the sign, and i.m.o. only use of clearly defined math prevents this kind of issue in the future...

(Github should really fix their missing support for TeX formulae in markdown)

fkiraly commented 3 years ago

@mllg, @RaphaelS1, indeed, I like the sanity check of computing concordance of crank predictions against response predictions.

However, note! If concordance index is defined as fraction of discordant pairs, as it is now, then the sanity check passes if the index is 0 and fails if it is 1. That is, in my opinion, another reason to change concordance index to fraction of concordant pairs.

Depends on the sign convention, of course - it should be consistent between sanity check and the actual predictions.

RaphaelS1 commented 3 years ago

@mllg

Good summary. It sounds like this issue should be easy to detect with the now fixed sanity check?

Should being the important word. ANNs still often fail the sanity test because untuned ANNs are not good... but I've added manual checks for these. Also sanity check will not pick up on distributions that have been composed from this sort of mistake, but again I've only found one instance of this and already fixed it. I'd be surprised if it happens again.

@fkiraly

I think higher crank should always, and consistently, be associated with higher average survival time

As said above, this will happen in a near-future patch. I've left it as-is for now just because I wanted to get the critical bug out the way. The actual user-interface won't change at all once we've made the change of sign.

If concordance index is defined as fraction of discordant pairs, as it is now, then the sanity check passes if the index is 0 and fails if it is 1.

No this is not the case, as explained before our concordance measure assumes that high rank = high risk

RaphaelS1 commented 3 years ago

Update:

The previous set of patches did not fully fix the bug as it transpires there were two separate bugs that had the effect of cancelling each other out. Both caused an incorrect calculation of crank, neither should have affected other results. These are broken down below:

Bug 1 - Internal computations

Many models return predicted survival probabilities as a matrix where rows are observations and columns are times, entries are predicted survival probabilities (or the transpose). Treated as distributions, these are degenerate as there are no predictions at t = 0 and more importantly no predictions where S(t) = 0. This means that computation of the mean of this distribution, will be wrong, which was used for crank. This is now fixed for all surv learners by implementing a helper function, .surv_returner which will automatically compute the distribution and 'fix it' by adding a column of 1s for survival probability at t = 0 and a column of 0s for survival probability at t = max(y)+1e3 where y is observed survival times. The bug meant that it was not the actual mean of the distribution but some random quantity, hence randomly giving rankings in the 'wrong direction'. This is now fixed and more robust sanity checks prevent this occurring again.

Bug 2 - External computations in lp

Different models define the linear predictor differently, and to make matters more complicated, some models actually change the definition within the same learner, e.g. in surv.mboost the c-index calculated on predict lp was previously less than 0.5 for PH models and greater than 0.5 for AFT models. I strongly suspect that what's happened here is that the negative linear predictor is returned for AFT models whereas the 'usual' positive lp is returned for PH models, without any warning in documentation. Similar problems are seen in surv.parametric where the ranked lp is reverse for PH and AFT/PO models. Again robust sanity checks, for PH and non-PH models prevent these errors occurring again.

mlr-org / mlr3proba