Closed jmwoloso closed 6 years ago
Hi there thanks for the questions and great asking them here!
I don't think I understand what you mean by groups
. To understand the data model I suggest checking out simple_example and data pipeline template in that order. I think it's much easier than you think!
[30,29,...,1]
as target value.Pr(churn)~Pr(no event within 14 days)=Pr(TTE>14*24*60)
or if you actually have something defined as a churn-event Pr(churn)~Pr(event within 14 days)=Pr(TTE<14*24*60)
. This is after training, and you may have only had 10 days of data to train on.I'm not sure what you mean by max_time
.
Thanks for the reply @ragulpr!
I spent some time working through your tensorflow git log example to get a better feel for the transformations.
You can also train on sequences that has yet to have any event. You want to for example predict time to click on cancel button. If someone has been signed up for 30 days without clicking the cancel button then you have a sequence of censored TTE [30,29,...,1] as target value.
You actually did a great job of interpreting what I meant by "groups" :). In the tensorflow git log example, the event we are interested in is commits, but the nature of that particular problem is that the dataset will ever only contain rows where the event we are modeling occurs. That was part of my initial confusion, but your response to 1. cleared that up for me.
As a follow-up to your response to my second question, censored data is data for which we have not seen an event occur (like the example in your response to 1). In addition, with this WTTE technique, are we also ever intentionally censoring known data during training? That is the essence of my second question.
Regarding max_time
, there was a value you assigned to max_time
in one of the scripts/notebooks I was working through and I assumed that value was used to artificially censor data so that we don't leak information about the future into the model while we are training it.
EDIT: another thing I meant to ask was what to do when you have an event occur more than once in a single time period resolution (e.g. they click the cancel button two separate times in a single day). Should we: 1) aggregate values 2) keep the records separate or 3) just count it as a single event?
Do we intentionally censor data?
What to do if many events in one interval?
Regarding leaking, I don't know about that parameter but it's actually very easy to make a perfectly convincing train/test split. It might be hard to understand the template-notebook, but essentially we make the training set from "all that was known up until a certain date" (and artificially censor at that date) and test set from everything known after that date. This results in a split with observations that can answer a couple of different questions.
I think the most rigorous form of testing is to actually only evaluate on the day after training set ends, using the time to event from the test set at that date (resulting in one evaluation point per sequence). Compare to the data-pipeline I evaluated AUC on all timesteps. I'll push these changes soon.
I preemptively answered the split-question for further reference to others :)
@ragulpr if I already have data aggregated by day, i can safely set ~discrete_time=False
and~ pad_between_steps=False
, right?
EDIT: actually, looking through the source code, it seems that discrete_time
is used to prevent incomplete interval measurements (i.e. sequences are shortened as necessary to only include the last full interval of data like yesterday's data, for instance). is that an accurate description?
EDIT 2: alright, i think I have a good intuition now for how data_pipeline
works. in my case, prior to any transformations, my data was 2 unique ids, each with (the same) 417 days worth of data across 47 features so the following settings were appropriate for my use case since the data was:
aggregated at the day-resolution for each unique id (so I set time_sec_interval=1
since no date_int conversion needs to take place and discrete_time=False
) and
days where nothing happened for each unique id had already been interpolated with zeros for the appropriate features (so pad_between_steps=False
since we've done this already)
additionally, since I had previously aggregated data at the day-resolution, there was no need to drop the last observation because I already knew it to be complete (so drop_last_timestep=False
)
this resulted in a tensor shape of (2, 417, 47)
which intuitively seems correct because i want to feed 417 days of data with 47 features for each of my two unique ids into the RNN for training. finally, infer_seq_endtime
should be False
as well since all of my unique ids have observations for the exact same sequence interval of 417 days.
closing this for now. great overall concept, but there are functions that reference variables from the outer scope instead of accepting them as args and the flow of data processing is hard to follow, reason about and apply to large datasets. i will revisit the concept when i have more time on my hands as i've already spent a great deal of time so far on this.
thanks for your help though!
@jmwoloso Sorry for the slow response (vacation+conference)! Thanks for your comments.
I'm sorry to hear it's hard to read, but it clearly sounds like you've understood the details of pipeline just right. I would be very happy to hear your thoughts or PRs. The data-pipeline-template ipynb is written for explicitness and should be correctly wrapped for production.
Regarding your specific questions, it all sounds right after edits. To clarify, discrete_time=True
in this case leads to aggregating over lower-resolution timesteps so with time_sec_interval=1
it should not have any effect.
drop_last_timestep=False
it won't remove the timestep with largest date/unix timestamp. discrete_time=False
and you want pad_between_steps=True
it'll raise an error (gotcha) since this makes no sense.If you did right in the preprocessing step, I think
discrete_time=1, drop_last_timestep=False, time_sec_interval=1, infer_seq_endtime=False
Would have the (same) effect you want, which is basically none other than making your [2*417, 47]
dataframe into a [2,417, 47]
-tensor
Hello @ragulpr !
Great work on this, very compelling. I have 2 higher-level questions (to potentially be followed up by other questions once I'm sure I understand the concept).
1) We only use groups with that have the event we're interested in modeling, correct? For instance, if I'm building a churn model, I could use the event I'm modeling to be something like clicking the 'Cancel' button on the web page, so my training set should only include groups that have done that event at least once right, or not?
2) We pick a time frame according to our resolution and within each sequence, we censor the data beyond that time period? So if I'm interested in people that will churn within the next 14 days, I set
max_time=14
(assuming my data is in the day resolution) and at each sequence/step, everything > 14 days ahead gets censored (features and target turned to 0) right?Thanks in advance for the insights. Looking forward to trying this out!