Pre-filtering by number of events

JasonTam commented 5 years ago

HI, I'm fairly new to this area, and I just wanted a sanity check to see if it makes sense to pre-filter a dataset based on number of events. For example, remove all users with less than k events in the observation period.

I can see this making sense with k=1 since we tend to drop the first event for all sequences anyway (https://github.com/ragulpr/wtte-rnn/issues/37#issuecomment-354388046). Of course this might depend on the dataset, and I plan to play around with it. However, I just wanted to know if it was maybe common practice to drop records like this. Or do we favor keeping all users so we can learn user-features correlated with single event->churn

Thanks

ragulpr commented 5 years ago

Very good question. Whatever you do, it leads to its specific little bias. I think it is a very common practice, and I don't think people are aware of how biasing it is.

In order to have to think about it the least, think about who and when you want to predict. Keeping training-set identical to prediction-dataset (i.e keep all users) will save you a lot of headache. That's my general advice.

For example, if you would like to predict the time to next event today for all those users who has ever been active? Then each days (even empty days) should probably be represented in the training-datasets.

If you train on users who has had at least 2 events in the past 60 days, then at t=0 the initial prediction the model should learn to make is that a user has probability=1 of having an event within 60 days, i.e it'll learn Pr(Y_0 < 60)=1. As you get closer to the end of the dataset, this should hold for smaller lookaheads to, i.e Pr(Y_30 < 30)=1 if there was less than 2 events in the first 30 days. In other words; you have learnt another query than you intented:

Pr(Y_t<y) = probability of having an event within y days given that they will have had at least 2 events in 60-t days,

So through your datamunging you're actually conditioning on the future instead of predicting the future :D

If it's impossible to let dataset represent all sequences since we started recording (which is the best), I think it makes more sense to have a look-back query of something like this;

SELECT 
    id,
    DATE(timestamp) as date 
    count(*) as n_events,
    ...
FROM
    PAYMENTS
WHERE 
    date>today-60
GROUP BY
    id,date

The query you would be training for is then

Pr(Y_t<y) = probability of having an event within y days given that we've seen them in the past 60-t days

Which is fine in the sense that, you'll make a prediction after their first event, so the filtering of the data doesn't reveal the future. It may raise its own paculiar questions[0], but the problems are much less apparent.

I haven't codified some catch-all solution to solve the problem of, say, 99% of users just arriving once (which causes very high sparsity and alot of data).

To calm your worries, It's not a WTTE-specific problem and I think these types of biases are present in many machine learning systems and they work anyway through blissfull ignorance.

[0] Ex, those that had events in the first days of this query are probably those sequences with many events in general. What kinds of biases does this induce? Also, there may be entrants into this dataset who haven't been active for more than 60 days. Does this cause problems?

TL:DR, try not to but it's complex.

JasonTam commented 5 years ago

Thanks for the detailed response! It was really helpful :)

ragulpr / wtte-rnn

Pre-filtering by number of events #53