Log-likelihood for discrete Weibull distribution

michael-tsel commented 5 years ago

There seems to be an issue with log-likelihood for discrete Weibull distribution with censored data (u=0). According to equation (2.7) in Proposition 2.26 of your great thesis, the likelihood in this case is _L_d = Pr(Td > t) = Pr(T >= t+1) for t in {0,1,2,...} However, I do believe that it should be _L_d = Pr(Td >= t) = Pr(T >= t) for t in {0,1,2,...} [Sorry, I have found no way to use TeX here]

Arguments are following. Assume u=0 and tte=0 for some fixed day. It means that the next event might occur at any day after that fixed day, so the probability should be equal to 1. In your case it's strictly lower than 1.

ragulpr commented 4 years ago

Hi there and sorry for the slow response. Very happy for the contribution and I'm surprised and impressed that someone took their time to think about this issue too, because I sure did.

I was worried that my definition may cause off-by one errors and confusion, but after trying out the alternative it caused more confusion for me in the long run when alternating between discrete/continuous time.

Assume u=0 and tte=0 for some fixed day. It means that the next event might occur at any day after that fixed day, so the probability should be equal to 1. In your case it's strictly lower than 1.

I read your interpretation as us disagreeing on how to index intervals - zero or 1 based. I thought long and hard about this and decided on the perspective of considering discretized time as indexed intervals, I chose the first interval index to be 0. I don't think there's a right or wrong here, just a matter of taste. But here's some arguments:

Numpy is zero-indexed (sorry R-using friends)
"today" = t_d=0 i.e rather than "day 1 is today". No true answer here, one could argue that "1st day" being t=0 is pretty confusing. But in a zero-based indexing framework having both definitions would be more confusing still.
If u=0 and (discrete) t_d=0 and we define (discrete time) t_d=0 as "event may occur at anytime, the 1st day or anytime after", that is indeed a sure event (p=1) so such an observation gives zero information. So there's really no reason for having such an observation in the dataset.
If u=0 and (discrete) t_d=0 in my framework I interpret this as "discrete time is greater than 0", which is the same thing as saying "continuous time is greater than 1".
This whole saga begins with me enjoying thinking about time intervals as right-open/cadlag i.e t∈[t_d,t_d+1), maybe because I come from a place where todays batch-jobs taking in yesterdays data were scheduled to run at 00.00

To me this seemed to make more sense but it's really a matter of taste. Again, thanks for the kind words.

ragulpr commented 4 years ago

Also, see my comments on #59

michael-tsel commented 4 years ago

So, if I understand you correctly, then for non-censored data (y=1) and t_d=0 you calculate likelihood as a probability of continuous t to get into [0,1), while for censored data ( y=0) and t_d=0 you calculate likelihood as a probability of continuous t to get into [1,+\infty). This makes things clear. However, one should read a reference carefully before feeding his data into WTTE-RNN framework. I feel that the issue can be closed.

ragulpr / wtte-rnn

Log-likelihood for discrete Weibull distribution #58