ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
765 stars 187 forks source link

Validating data_pipeline #9

Closed NataliaVConnolly closed 6 years ago

NataliaVConnolly commented 7 years ago

Hi Egil,

Thanks so much for releasing the end-to-end wtte-rnn code in data_pipeline! Very cool stuff.

I had a question about validating the performance though. To check how well the model predicts TTE, I did the following:

predicted_t = model.predict(x_test)
predicted_t[:,:,1]=predicted_t[:,:,1]+predicted_t[:,:,0]*0# lazy re-add NAN-mask
print(predicted_t.shape)

pred_df = tr.padded_to_df(predicted_t,column_names=["alpha","beta"],dtypes=[float,float],ids=pd.unique(df.id))
pred_df['pred_tte'] = pred_df.apply(lambda g: g.alpha*math.pow(math.log1p(0.5),(1/g.beta)),axis=1)
pred_df['actual_tte'] = y_test[:,:,0].flatten()

where

 x_test      = left_pad_to_right_pad(right_pad_to_left_pad(x)[:,(n_timesteps-n_timesteps_to_hide):,:])
 y_test      = left_pad_to_right_pad(right_pad_to_left_pad(y)[:,(n_timesteps-n_timesteps_to_hide):,:])
 events_test = left_pad_to_right_pad(right_pad_to_left_pad(events)[:,(n_timesteps-n_timesteps_to_hide):])

 y_test[:,:,0] = tr.padded_events_to_tte(events_test,discrete_time=discrete_time,t_elapsed=padded_t)
 y_test[:,:,1] = tr.padded_events_to_not_censored(events_test,discrete_time)

What I got looked like this:

>  pred_df

Out[16]:
        id  t   alpha          beta     pred_tte    actual_tte
0   1   0   0.148557    0.743166    0.044092    10
1   1   1   18.626453   0.687964    5.014936    9 
2   1   2   21.242054   0.726595    6.132385    8
3   1   3   29.170420   0.734831    8.539321    7
4   1   4   30.190809   0.744385    8.978482    6
... ... ... ... ... ... ...
5233    802 49  4.187856    0.667144    1.082288    1
5234    802 50  5.580938    0.632970    1.340699    0
5235    802 51  2.631150    0.609310    0.598024    3
5236    802 52  4.732635    0.670265    1.230809    2
5237    802 53  5.632733    0.642269    1.381371    1

So if you plot predicted TTE vs. actual they don't agree much, not even directionally. Clearly I am missing something. Is this not a valid way to compare predicted vs. actual TTE?

Thank you! Natalia

ragulpr commented 7 years ago

Hi there, Thanks for the kind words.

  1. Your actual TTE might be the censored TTE (so pred_tte would seem to overestimate it for the censored observations). Try removing censored values to get a sense of magnitude.

  2. It looks like your trying to predict the median TTE (a * np.power(-np.log(1.0 - p), 1.0 / b) in numpy with p=0.5). I would be more interested in about 50% of the actual tte being below the predicted than how close it is

  3. If you want to get a general sense of performance, take the (scale dependent) correlation between actual and predicted. Also try (scale independent) correlation after dense rank (leading to rank-correlation)

aprotopopov commented 7 years ago

I think the problem is still actual because in example with full pipeline: https://github.com/ragulpr/wtte-rnn/blob/master/examples/data_pipeline/data_pipeline.ipynb there is picture for one individual, which is not trending right TTE: image And there are not so many censored data: image Could you know what could be the reason for that and what could be fixed in the example to fit data appropriately?

ragulpr commented 7 years ago

So @aprotopopov the direction it goes actually makes sense, given the information available. If someone hasn't committed in a while there's a risk that it's a long break or that they've stopped altogether. I bet that's what the algo thinks anyway.

The scale is a bit off though. It's worth digging into, but it's worth considering that that is a serial-committer and many in the dataset commit very sparsely (high TTEs)

Another way I did not mention if you're worried about performance is to see how good it is in terms of fixed windows:

from sklearn import metrics
aucs =[]
# compare score to sliding box up to some width, up to last decile
max_box_width = np.sort(seq_lengths)[-len(seq_lengths)/10]

for box_width in xrange(max_box_width):
    if (box_width%10)==0:
        # select only unmasked and comparable datapoints.

        m = ~np.isnan(y[:,:,1])
        # uncensored or within box_width of boundary 
        m[m] = (y[:,:,1][m]==1)|(box_width<y[:,:,1][m]) 

        actual = y[:,:,0][m].flatten()<=box_width
        pred   = weibull.cmf(a=predicted[:,:,0],b=predicted[:,:,1],t=box_width)[m].flatten()

        fpr,tpr,thresholds = metrics.roc_curve(actual,pred)
        print 'auc: ',metrics.auc(fpr,tpr),' sliding box ',box_width
        aucs.append(metrics.auc(fpr,tpr))
plt.plot(aucs)

For git-repos I get AUCs in the 70-90 range for most reasonable windowsizes

Btw @aprotopopov I did not forget about your PR, it's great but I'm just trying to merge it with some local changes! Will comment ASAP