Deep learning is hard. Really small silly changes will have big impact. One such that I've discovered is to clip log-likelihood. By clipping it at <log(1-p) it will stop pushing censored observations to the right once the likelihood for the observation is 1-p.
This makes infinity-predictions controllable and fixes 99% of the problem of NaN and numerical instability. I.e, if we set p=1e-4 there will be zero gradient contribution once it found a threshold t s.t Pr(Y>t)=0.999. I previously refrained from clipping since t will not really have a meaning thinking it should/could go to infinity and it should fail. With clipping this wont happen. Interpretations of predictions should be modified to account for this. I concluded benefits outweighs this minor problem.
[x] Version number
[x] Rerun wtte-rnn-examples
[ ] add changelog
Changes
Add clipping to log-likelihood dcebad233bd318f8529463bb51a38dfc77434a21
Deprecate penalization of beta for regularization. I've found that clipping and modulating beta through the activation function parameters is much more effective. c9bfdba609c55af24b55d0da8ce75a9bd964e9cd
Backward-compatible updates of the API of wtte. It's just a little less ugly, i.e call loss_fun = wtte.Loss(type='discrete').loss_function. instead of wtte.loss.... ba130459c52dfe4ba1bbb67c920c51ede73077ab
Added a outputlayer-bias pre-training step to the wtte-rnn-examples. It improves numerical stability and greatly shortens training time, even though it's ugly to have a step like this.
Deep learning is hard. Really small silly changes will have big impact. One such that I've discovered is to clip log-likelihood. By clipping it at
<log(1-p)
it will stop pushing censored observations to the right once the likelihood for the observation is1-p
.This makes infinity-predictions controllable and fixes 99% of the problem of NaN and numerical instability. I.e, if we set
p=1e-4
there will be zero gradient contribution once it found a thresholdt
s.tPr(Y>t)=0.999
. I previously refrained from clipping sincet
will not really have a meaning thinking it should/could go to infinity and it should fail. With clipping this wont happen. Interpretations of predictions should be modified to account for this. I concluded benefits outweighs this minor problem.Changes
beta
for regularization. I've found that clipping and modulating beta through the activation function parameters is much more effective. c9bfdba609c55af24b55d0da8ce75a9bd964e9cdloss_fun = wtte.Loss(type='discrete').loss_function
. instead ofwtte.loss....
ba130459c52dfe4ba1bbb67c920c51ede73077abwtte-rnn-examples
. It improves numerical stability and greatly shortens training time, even though it's ugly to have a step like this.