Open adam-haber opened 6 years ago
Hi, great questions. You understood it right, throw away the first timestep. There's alternatives but I think this was the most generally safe.
From the data pipeline template:
# 1. Disalign features and targets otherwise truth is leaked.
# 2. drop first timestep (that we now dont have features for)
# 3. nan-mask the last timestep of features. (that we now don't have targets for)
events = events[:,1:,]
y = y[:,1:]
x = np.roll(x, shift=1, axis=1)[:,1:,]
x = x + 0*np.expand_dims(events,-1)
The most thorough explanation can be found here
event <-> datapoint
i.e sequence birth comes from event
it's always TTE=0
so it'll overfit. event -> datapoint
but datapoint -/-> event
so now there's uncertainty about tte and you could probably use the first timestep.So TL:DR, in your case (non-recurrent events) it might be safe, but does it make sense for inference? I.e, when does your data arrive?
I guess you want to predict will there be an event today?
But if at signup 13.30 we get language, region, signup method etc this query is going to be tainted with the time of arrival of the data. (Things like less likelihood of event the later data arrives that day). I'm not saying it doesn't make sense, I'm saying it adds things to think about 😄
About question 2: Yes this sounds correct!
Hi,
I've two questions regarding the preprocessing functions:
Regarding
prep_tensors
- the linesSimply throw away the first event, right? Is this a necessity? In my data, a significant portion of the chruners churn at the beginning, and I'd be happy to try and predict these, as well.
Regarding the
nanmask_to_keras_mask
function: As far as I understand, they
variable returned by this function is of dimension(n_subjects,t_timesteps,2)
, such thaty[i]
is the matrix whose rows are the different times and its columns are time-to-event and censoring indicator, respectively, for subjecti
. In my data, each subject is either churned or not churned (no recurrent events). This means that for each subject, the second column is either all ones (if it's a churned subject) or all zeros (if it's a censored subject); this, of course, without taking into account the 0.95 mask. Is this the correct input format for training the model?