ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
767 stars 187 forks source link

Log-likelihood for discrete Weibull distribution #58

Open michael-tsel opened 5 years ago

michael-tsel commented 5 years ago

There seems to be an issue with log-likelihood for discrete Weibull distribution with censored data (u=0). According to equation (2.7) in Proposition 2.26 of your great thesis, the likelihood in this case is _L_d = Pr(Td > t) = Pr(T >= t+1) for t in {0,1,2,...} However, I do believe that it should be _L_d = Pr(Td >= t) = Pr(T >= t) for t in {0,1,2,...} [Sorry, I have found no way to use TeX here]

Arguments are following. Assume u=0 and tte=0 for some fixed day. It means that the next event might occur at any day after that fixed day, so the probability should be equal to 1. In your case it's strictly lower than 1.

ragulpr commented 4 years ago

Hi there and sorry for the slow response. Very happy for the contribution and I'm surprised and impressed that someone took their time to think about this issue too, because I sure did.

I was worried that my definition may cause off-by one errors and confusion, but after trying out the alternative it caused more confusion for me in the long run when alternating between discrete/continuous time.

Assume u=0 and tte=0 for some fixed day. It means that the next event might occur at any day after that fixed day, so the probability should be equal to 1. In your case it's strictly lower than 1.

I read your interpretation as us disagreeing on how to index intervals - zero or 1 based. I thought long and hard about this and decided on the perspective of considering discretized time as indexed intervals, I chose the first interval index to be 0. I don't think there's a right or wrong here, just a matter of taste. But here's some arguments:

To me this seemed to make more sense but it's really a matter of taste. Again, thanks for the kind words.

ragulpr commented 4 years ago

Also, see my comments on #59

michael-tsel commented 4 years ago

So, if I understand you correctly, then for non-censored data (y=1) and t_d=0 you calculate likelihood as a probability of continuous t to get into [0,1), while for censored data ( y=0) and t_d=0 you calculate likelihood as a probability of continuous t to get into [1,+\infty). This makes things clear. However, one should read a reference carefully before feeding his data into WTTE-RNN framework. I feel that the issue can be closed.