ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
767 stars 187 forks source link

wtte.pipelines.data_pipeline returns wrong seq_ids #61

Open michigann opened 5 years ago

michigann commented 5 years ago

Hi, I found there is some problem with data preprocessing functions.

The problem is when we want to get result from our model for sequences and its id, when we use lib data_pipeline function for preprocessing our data. Ok, so to the point. data_pipeline function in wtte.pipelines module seems to return seq_ids in wrong order. So it causes problem with seq_index-to-seq_id mapping. The bug is in df_to_array function in its second instruction line: unique_ids = list(grouped.groups.keys()). Grouped seqneces aren't ordered by its ids so padded feature vector based on it can have different order than seq_ids returned from data_pipeline function. Its because data_pipeline returns sequences ordered by id_col in passed padnas dataframe, but df_to_array creates features sequences based on pandas groupby order which may be different, like in my case. My suggestion to fix this bug (the simplest one) is just to change unique_ids = list(grouped.groups.keys()) to unique_ids = df[id_col].unique() in df_to_array.