Question about problem setup; predicting on real-world censored data

lseffer commented 6 years ago

Hey I tried commenting on the github.io post but I will probably receive an answer faster here. I've been interested in trying this out on some real data, but I have a question about the setup of the problem.

If I have understood this correctly, the censoring indicator is in the label dataset. Doesn't this pose a problem in the real-world case of predicting for example customer next visit time? In the real-world case all present-day data is censored. So wouldn't the model vastly underestimate the time-to-event when the censoring indicator isn't part of the data fed through the model? Or is the point to do some adjustment post-prediction to account for censoring like what happens during the training phase?

ragulpr commented 6 years ago

Sorry for the slow answer, and thanks for the great question. I replied in the blog too http://disq.us/p/1mghgb2.

Geist is, censoring indicator should be considered part of the training data only. We use every timestep for every customer to train the algorithm.

The present-day data will naturally be mostly censored, but these censored points says "The TTE was greater than 0" which might lead to overestimation but hard to see how it can lead to underestimation.

lseffer commented 6 years ago

Thanks for getting back to me! However I'm still not sure I understand how the model wouldn't underestimate the time to event. (With predictions I refer to the mean or median of the distribution).

As an example say we have a customer with 30 days since last event so censored with time to event at least 30. Maybe the model has learned something about the pattern underlying long periods of inactivity, but still without the knowledge of the minimum bound the model might predict 15 days or whatever number less than 30 in this example.

Even for relatively short periods of inactivity we could end up with predictions lower than the observed censoring. Hopefully you can shed some light on this if I have misinterpreted something.

ragulpr commented 6 years ago

You have a very valid question, censoring is a hard concept and the reason why it should work has puzzled me and why I've put in so much effort to convince myself.

Why we can figure out correct distribution even if there's (one type of) censoring:

Empirical argument: It actually seems to work (Check the tests, notebook, and the visualization of same experiment )
Mathematical argument: check the proof on page 18
Intuitive argument: "Push beyond point of censoring if censored, concentrate at tte if uncensored"

But there's actually a shorter answer to your question:

As an example say we have a customer with 30 days since last event so censored with time to event at least 30. Maybe the model has learned something about the pattern underlying long periods of inactivity, but still without the knowledge of the minimum bound the model might predict 15 days or whatever number less than 30 in this example.

The minimum bound is "at least 30" and it'll try to push density above there, hence the median/mean should be above 30 if it's inferrable!

lseffer commented 6 years ago

Hey, I tried maybe a bad example of WTTE-RNN with the jet engine data before reading through your examples, where we have readings on every timestep. This didn't yield very good results on the data I trained on (scale and predictions were ok but overall accuracy was quite bad), but I realized that I have missed something crucial.

I see that I should train on data between events also, and therefore censoring time would be part of an observation's history. So regarding my previous question you are correct in that the minimum bound will be respected.

If I may ask a couple more questions about the problem setup that have puzzled me a bit since reading your examples.

Masking

First is the masking that is needed for shorter sequences, I get that I can mask them with impossible/improbable values in X but what about y? Why have you used the expected TTE and 0.95 in your data pipeline examples, aren't these anyway discarded when you pass the 0.0 sample_weights for these points, why not use some random crap value here also?

Censoring backwards?

Why do we count the censoring backwards? I.e. if we have a censored observation with time since event 30, we start counting at 30 and go down to 1 when we hit the "present".

Training time, size, loss etc.

What are some typical metrics of this problem? I am seeing that loss plateaus quite fast, assuming I can get it to train without nan losses etc. can I expect to hit a loss close to 0 or should I be happy when I have reached plateau?

Have you tried estimating with deeper or wider networks?

How much data approximately would you say this needs for convergence? From my view it looks to be converging quite rapidly even with small amounts of data. I know it's an impossible question but if you have any ballpark numbers it would be beneficial to know.

Big thanks for being so active here and answering questions. I'm hoping this could replace / complement our current churn models.

ragulpr commented 6 years ago

why not use some random crap value here also?

I am! Just crappy but numerically reasonable. I can spot 0.95 when I'm testing and expected TTE is unlikely to cause numerical problems. Masked points may (depending on backend) be used in forward pass.

Why do we count the censoring backwards?

It's apparent from the definition of censoring sorry

What are some typical metrics of this problem?

Good question that I don't have numbers for. In short it depends on the noisiness of the data and I haven't given it much thought. %Censoring has huge impact on magnitude of the loss. In the noisy github-commit log example the training loss goes to around 1.2665 and in a null-weibull model it reaches around 1.7 with correct parameters.

If the model is initialized properly outputting alpha around mean tte and beta around 1 you could see every step downward as improvement over the null-exponential model. I'm always happy when I reach a plateu and early stopping there is often numerically sound.

Have you tried estimating with deeper or wider networks?

My experience is that stacking recurrent layers gives a smoother prediction between timesteps and reduces overfit, but learns and reacts slowly to incoming data. Wider network learns way faster but sometimes this causes overfit and numerical instability

I have no hint regarding data sorry. I've had good results with very small datasets, but then again my expectations about how tight the predictions would be weren't that high.

lseffer commented 6 years ago

Thank you - all good answers that came to good use preprocessing my data and training this bad-boy. Got mostly nan losses when I tried a wider network, but deep and narrow network seems to work quite well and also reduces overfit as you said.

I am in the process of evaluating this model, will return if I have something interesting to share.

ragulpr commented 6 years ago

Great to hear! Looking forward to hear your results. Also, check out #33, almost always there's truth being revealed (i.e data problem) when loss turns NaN!

ragulpr / wtte-rnn

Question about problem setup; predicting on real-world censored data #32