High-level Questions - Githubissues

ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction

MIT License

767 stars 187 forks source link

High-level Questions #42

Closed jmwoloso closed 6 years ago

jmwoloso commented 6 years ago

Hello @ragulpr !

Great work on this, very compelling. I have 2 higher-level questions (to potentially be followed up by other questions once I'm sure I understand the concept).

1) We only use groups with that have the event we're interested in modeling, correct? For instance, if I'm building a churn model, I could use the event I'm modeling to be something like clicking the 'Cancel' button on the web page, so my training set should only include groups that have done that event at least once right, or not?

2) We pick a time frame according to our resolution and within each sequence, we censor the data beyond that time period? So if I'm interested in people that will churn within the next 14 days, I set max_time=14 (assuming my data is in the day resolution) and at each sequence/step, everything > 14 days ahead gets censored (features and target turned to 0) right?

Thanks in advance for the insights. Looking forward to trying this out!

ragulpr commented 6 years ago

Hi there thanks for the questions and great asking them here! I don't think I understand what you mean by groups. To understand the data model I suggest checking out simple_example and data pipeline template in that order. I think it's much easier than you think!

No. You can also train on sequences that has yet to have any event. You want to for example predict time to click on cancel button. If someone has been signed up for 30 days without clicking the cancel button then you have a sequence of censored TTE [30,29,...,1] as target value.
I'm not sure I understand! The point of wtte is that features, time resolution, censoring, predicting window are made to be pretty orthogonal and you can change most of it independently. Censoring is only used for training. Features is only used for prediction (forward propagation). If you want to predict if someone 'churned' within 14 days you could still discretize time resolution to minute level and predict Pr(churn)~Pr(no event within 14 days)=Pr(TTE>14*24*60) or if you actually have something defined as a churn-event Pr(churn)~Pr(event within 14 days)=Pr(TTE<14*24*60). This is after training, and you may have only had 10 days of data to train on.

I'm not sure what you mean by max_time.

jmwoloso commented 6 years ago

Thanks for the reply @ragulpr!

I spent some time working through your tensorflow git log example to get a better feel for the transformations.

You can also train on sequences that has yet to have any event. You want to for example predict time to click on cancel button. If someone has been signed up for 30 days without clicking the cancel button then you have a sequence of censored TTE [30,29,...,1] as target value.

You actually did a great job of interpreting what I meant by "groups" :). In the tensorflow git log example, the event we are interested in is commits, but the nature of that particular problem is that the dataset will ever only contain rows where the event we are modeling occurs. That was part of my initial confusion, but your response to 1. cleared that up for me.

As a follow-up to your response to my second question, censored data is data for which we have not seen an event occur (like the example in your response to 1). In addition, with this WTTE technique, are we also ever intentionally censoring known data during training? That is the essence of my second question.

Regarding max_time, there was a value you assigned to max_time in one of the scripts/notebooks I was working through and I assumed that value was used to artificially censor data so that we don't leak information about the future into the model while we are training it.

EDIT: another thing I meant to ask was what to do when you have an event occur more than once in a single time period resolution (e.g. they click the cancel button two separate times in a single day). Should we: 1) aggregate values 2) keep the records separate or 3) just count it as a single event?

ragulpr commented 6 years ago

Do we intentionally censor data?

I do this personally especially for very long sequences (artificially censor s.t TTE<365). My hunch is that this prevents overfit/artifactual learning of censoring with little loss of information. I haven't published it as it felt like it added complexity to the notebook.

What to do if many events in one interval?

If many events occured in one day there was an event that day. So it can be considered a single event. This is how discretization works. But feature-wise you can do whatever, but having "number of events yesterday" as input feature has made sense to me.

Leaking

Regarding leaking, I don't know about that parameter but it's actually very easy to make a perfectly convincing train/test split. It might be hard to understand the template-notebook, but essentially we make the training set from "all that was known up until a certain date" (and artificially censor at that date) and test set from everything known after that date. This results in a split with observations that can answer a couple of different questions.

Timesteps that are present in train and test but censored in train and uncensored in test. Evaluating on these give us indication on whether censoring point is artificially learned.
Timesteps that are not present in train but present in test. We can evaluate on these freely.
Timesteps that are both uncensored in train and test. We still need the features of these timesteps to do forward prop for the RNN but evaluation here is meaningless.
I made a validation set by random selection of sequences from the testing set. This was done with no particular thought, mainly because evaluation on every epoch is slow. Weights ensure that only timesteps after the end of the training set is measured. (I.e in the "future")

I think the most rigorous form of testing is to actually only evaluate on the day after training set ends, using the time to event from the test set at that date (resulting in one evaluation point per sequence). Compare to the data-pipeline I evaluated AUC on all timesteps. I'll push these changes soon.

_{^{I preemptively answered the split-question for further reference to others :)}}

jmwoloso commented 6 years ago

@ragulpr if I already have data aggregated by day, i can safely set ~discrete_time=False and~ pad_between_steps=False, right?

EDIT: actually, looking through the source code, it seems that discrete_time is used to prevent incomplete interval measurements (i.e. sequences are shortened as necessary to only include the last full interval of data like yesterday's data, for instance). is that an accurate description?

EDIT 2: alright, i think I have a good intuition now for how data_pipeline works. in my case, prior to any transformations, my data was 2 unique ids, each with (the same) 417 days worth of data across 47 features so the following settings were appropriate for my use case since the data was:

aggregated at the day-resolution for each unique id (so I set time_sec_interval=1 since no date_int conversion needs to take place and discrete_time=False) and
days where nothing happened for each unique id had already been interpolated with zeros for the appropriate features (so pad_between_steps=False since we've done this already)
additionally, since I had previously aggregated data at the day-resolution, there was no need to drop the last observation because I already knew it to be complete (so drop_last_timestep=False)

this resulted in a tensor shape of (2, 417, 47) which intuitively seems correct because i want to feed 417 days of data with 47 features for each of my two unique ids into the RNN for training. finally, infer_seq_endtime should be False as well since all of my unique ids have observations for the exact same sequence interval of 417 days.

jmwoloso commented 6 years ago

closing this for now. great overall concept, but there are functions that reference variables from the outer scope instead of accepting them as args and the flow of data processing is hard to follow, reason about and apply to large datasets. i will revisit the concept when i have more time on my hands as i've already spent a great deal of time so far on this.

thanks for your help though!

ragulpr commented 6 years ago

@jmwoloso Sorry for the slow response (vacation+conference)! Thanks for your comments.

I'm sorry to hear it's hard to read, but it clearly sounds like you've understood the details of pipeline just right. I would be very happy to hear your thoughts or PRs. The data-pipeline-template ipynb is written for explicitness and should be correctly wrapped for production.

Regarding your specific questions, it all sounds right after edits. To clarify, discrete_time=True in this case leads to aggregating over lower-resolution timesteps so with time_sec_interval=1 it should not have any effect.

By default aggregation is done by summation, and with drop_last_timestep=False it won't remove the timestep with largest date/unix timestamp.
If discrete_time=False and you want pad_between_steps=True it'll raise an error (gotcha) since this makes no sense.

If you did right in the preprocessing step, I think

discrete_time=1, drop_last_timestep=False, time_sec_interval=1, infer_seq_endtime=False Would have the (same) effect you want, which is basically none other than making your [2*417, 47] dataframe into a [2,417, 47]-tensor