Closed RaphaelS1 closed 3 years ago
Good summary. It sounds like this issue should be easy to detect with the now fixed sanity check?
From our discussion earlier:
crank
should always, and consistently, be associated with higher average survival timeThis would require:
crank
prediction meaning higher predicted average survival time and lower predicted event riskI would prefer this above the alternative of leaving crank
convention as-is, and changing the signs to the opposite. Why: in that alternative scenario, "concordant pairs" would be pairs with an opposite difference sign, which seems counterintuitive and prone to mis-interpretation (and mis-implementation).
In either case, there needs to be a clearly documented extension guideline which makes explicit reference to what - as mathematical objects - the inputs and returns should be. The devil is in the detail, in this case in the sign, and i.m.o. only use of clearly defined math prevents this kind of issue in the future...
(Github should really fix their missing support for TeX formulae in markdown)
@mllg, @RaphaelS1, indeed, I like the sanity check of computing concordance of crank
predictions against response
predictions.
However, note! If concordance index is defined as fraction of discordant pairs, as it is now, then the sanity check passes if the index is 0 and fails if it is 1. That is, in my opinion, another reason to change concordance index to fraction of concordant pairs.
Depends on the sign convention, of course - it should be consistent between sanity check and the actual predictions.
@mllg
Good summary. It sounds like this issue should be easy to detect with the now fixed sanity check?
Should being the important word. ANNs still often fail the sanity test because untuned ANNs are not good... but I've added manual checks for these. Also sanity check will not pick up on distributions that have been composed from this sort of mistake, but again I've only found one instance of this and already fixed it. I'd be surprised if it happens again.
@fkiraly
I think higher crank should always, and consistently, be associated with higher average survival time
As said above, this will happen in a near-future patch. I've left it as-is for now just because I wanted to get the critical bug out the way. The actual user-interface won't change at all once we've made the change of sign.
If concordance index is defined as fraction of discordant pairs, as it is now, then the sanity check passes if the index is 0 and fails if it is 1.
No this is not the case, as explained before our concordance measure assumes that high rank = high risk
Update:
The previous set of patches did not fully fix the bug as it transpires there were two separate bugs that had the effect of cancelling each other out. Both caused an incorrect calculation of crank
, neither should have affected other results. These are broken down below:
Bug 1 - Internal computations
Many models return predicted survival probabilities as a matrix where rows are observations and columns are times, entries are predicted survival probabilities (or the transpose). Treated as distributions, these are degenerate as there are no predictions at t = 0 and more importantly no predictions where S(t) = 0. This means that computation of the mean of this distribution, will be wrong, which was used for crank
. This is now fixed for all surv learners by implementing a helper function, .surv_returner
which will automatically compute the distribution and 'fix it' by adding a column of 1s for survival probability at t = 0 and a column of 0s for survival probability at t = max(y)+1e3 where y is observed survival times. The bug meant that it was not the actual mean of the distribution but some random quantity, hence randomly giving rankings in the 'wrong direction'. This is now fixed and more robust sanity checks prevent this occurring again.
Bug 2 - External computations in lp
Different models define the linear predictor differently, and to make matters more complicated, some models actually change the definition within the same learner, e.g. in surv.mboost
the c-index calculated on predict lp
was previously less than 0.5
for PH models and greater than 0.5
for AFT models. I strongly suspect that what's happened here is that the negative linear predictor is returned for AFT models whereas the 'usual' positive lp is returned for PH models, without any warning in documentation. Similar problems are seen in surv.parametric
where the ranked lp
is reverse for PH and AFT/PO models. Again robust sanity checks, for PH and non-PH models prevent these errors occurring again.
Bug
The
crank
return type is defined such that a higher value implies a higher risk of event, however in the listed learners below this has not been the case, affecting all measures of discrimination.This bug was introduced as several implemented packages do not clearly document how the predicted risk should be interpreted, in some cases a higher value is associated with higher risk of event, in others the reverse is true. In particular the packages {survival} and {mboost} reverse this within a single learner where in some instances the returned linear predictor will be interpreted as "high value = high risk" and in others "high value = low risk".
A secondary fault was also found internally where the default
crank
was set as the mean survival time, when it should have been the negative mean survival time. In particularsurv.parametric
was returning wrongcrank
anddistr
values, which resulted from a lack of documentation in thesurvival
package indicating that the linear predictor should be interpreted differently from convention.Fix
crank
predictions from any of the above learners should be modified by subtracting the prediction from ‘1’.surv.cindex
will be sub-optimal, this cannot be fixed post-hocAffected pipeops/pipelines
mlr3proba
Affected learners
mlr3learners
mlr3extralearners:
Future Patches
In addition the following non-critical updates will be made in future patches related to these bugs:
crank
will be reversed along with thecindex
calculations. This will have no impact on current code and will not cause any breakages, assumingcrank
has only been used for measures of discrimination.PipeOpDistrCompositor
will be extended to take into account that thelp
should occasionally instead be taken to be-lp
.(cc. @berndbischl @mllg @adibender @fkiraly)