ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
767 stars 187 forks source link

Question about preprocessing functions #37

Open adam-haber opened 6 years ago

adam-haber commented 6 years ago

Hi,

I've two questions regarding the preprocessing functions:

ragulpr commented 6 years ago

Hi, great questions. You understood it right, throw away the first timestep. There's alternatives but I think this was the most generally safe.

From the data pipeline template:

    # 1. Disalign features and targets otherwise truth is leaked.
    # 2. drop first timestep (that we now dont have features for)
    # 3. nan-mask the last timestep of features. (that we now don't have targets for)
    events = events[:,1:,]
    y  = y[:,1:]
    x  = np.roll(x, shift=1, axis=1)[:,1:,]
    x  = x + 0*np.expand_dims(events,-1)

The most thorough explanation can be found here

So TL:DR, in your case (non-recurrent events) it might be safe, but does it make sense for inference? I.e, when does your data arrive?

I guess you want to predict will there be an event today? But if at signup 13.30 we get language, region, signup method etc this query is going to be tainted with the time of arrival of the data. (Things like less likelihood of event the later data arrives that day). I'm not saying it doesn't make sense, I'm saying it adds things to think about 😄

About question 2: Yes this sounds correct!