ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
762 stars 186 forks source link

Input Data format #43

Closed davidgasquez closed 5 years ago

davidgasquez commented 6 years ago

Hey there! We've been dealing with the churn prediction problem using the typical approach. Now, we'd would love to give a try to this less hacky approach!

First, I'd love to share a bit more of what we have. I think our evented data is a bit different from the required on this approach since is not made by evenly spaced events. Each user will have it's own timeseries data with events at different (and continuous) times. Looking at the data by time we can see something like the following table.

Time Event Type User
1 1 A
2 3 A
3 1 B
4 4 B
5 3 A
6 1 C
7 1 D
8 5 B
9 5 D
10 4 A
11 3 C
12 5 C
13 0 B
14 2 D
15 4 C
16 4 C
17 0 D
18 3 A
19 2 A
20 2 A

For example, looking at the table and assuming that the event 0 means churn, we know that users B and D churned.

My question then is: What would be the input of the Keras model if we want to use the WTTE-RNN approach?

I'm also a bit confused about the training, since we're predicting sequences, should we fed the network the entire sequence of churned users at prediction time? Since the sequence will be right censored at prediction time, I'm not sure what's the correct approach here!

Sorry if this are very basic questions! Also, thanks for writing such a great and comprehensible article!

ragulpr commented 6 years ago

Do you really have a churn-event? As I pointed out in my blog, it often makes more sense to predict time to good events and hope it's not predicted as far away. In any case, WTTE is about predicting time to event so event can be anything, even churn if there is one~

If you want to use an RNN it's a very different thing than your prior approach. Then you would want to put your features on a [n_users,n_timesteps_per_user_max,n_features]-form, described in ipynb. If you have a csv that you can specify the id, datetime, event-specifier column and numeric feature-columns it should be just changing 2-3 lines of code in first cell. Ex data above, I would binarize the event types and specify one of them as event-column.

You should get something like a [4,19,5]-dimensional feature input and calculate a [4,19,2] target tensor.

You do not need to use an RNN. If you can figure out how to get a time-to-event and non-censoring indicator as target (Y) you can use the feature format as in the linked notebooks, i.e with no temporal dimension.

I'm also a bit confused about the training, since we're predicting sequences, should we fed the network the entire sequence of churned users at prediction time? Since the sequence will be right censored at prediction time, I'm not sure what's the correct approach here!

An RNN needs history so you would just use the prediction for each sequance at their latest timestep. At prediction time we don't know the target TTE at all so of course target is right censored ;)

Good luck and keep questions coming~

davidgasquez commented 5 years ago

Thanks for the reply @ragulpr! Sorry for the delay getting back. :smile: We'll probably apply this later on this project.

Do you really have a churn-event? As I pointed out in my blog, it often makes more sense to predict time to good events and hope it's not predicted as far away.

That makes sense. We do have a lot of events but we can trim them down!

You should get something like a [4,19,5]-dimensional feature input and calculate a [4,19,2] target tensor.

Still a bit confused with this as the dimension of the input. That said, I haven't spent a lot of time thinking about it jet.


Thanks again for the help! I'll keep you posted once we start with this.

ragulpr commented 5 years ago

Just think [#users, #timesteps, #features] as the dimension! Keep posting questions here and I'll answer them!