ragulpr / wtte-rnn

WTTE-RNN a framework for churn and time to event prediction
MIT License
762 stars 186 forks source link

Transforming data from long format to wtte-rnn format #34

Open adam-haber opened 6 years ago

adam-haber commented 6 years ago

I'm struggling with formatting my data into wtte-rnn compatible format. Since the format I work with is rather common in survival analysis, I thought the question might be relevant for other practitioners, as well.

My data is in the so-called long format (aka per-timepoint format) - that is, I have one line per subject per timepoint. This defines an interval-based survival object: (tstart, tstop, event), and allows me to incorporate time-varying coefficients (e.g., blood pressure), or time dependent events (e.g., application of some drug).

Also, since the event is death, a subject only "churns" once (unlike the purchasing examples), which is also rather common in survival analysis, as far as I understand.

If anyone has a code snippet/general tips on how to do this, it would be much appreciated.

ragulpr commented 6 years ago

Hi Adam, thank you for your question. If I understand it right, on the format(tstart,tstop,feature) feature is true from tstart to tstop (example, certain bloodpressure during a period of time).

Did you check out https://github.com/ragulpr/wtte-rnn-examples/blob/master/examples/?

Here I made transformations from another format to the correct tensor-format. One way to get your dataframe in the (id,timestamp,feature)-format is to transform

(tstart1,tstop1,bloodpressure1)
(tstart2,tstop2,bloodpressure2)
->
(tstart1,bloodpressure1)
(tstop1,bloodpressure1)
(tstart2,bloodpressure2)
(tstop2,bloodpressure2)

In the current implementation we want to map the feature values to intervals and fill in zeros where we don't have data. As we can assume bloodpressure to be the same between measurements you could carry forward the value.

I think it should make sense once you start datamunging :) Good luck!

Anyone else got any canonical solution?

adam-haber commented 6 years ago

Still not sure I get it.

Say I have:

  1. N patients.
  2. K_s static features for each of the patients (e.g. sex)
  3. K_d time-varying features for each of the patients (e.g. got\didn't get med. in month X)
  4. T months of data (interval data), during which some of the patients had an event and some did not.

According to this bit of code:

def prep_tensors(x,events):
    # 0. calculate time to event and censoring indicators.
    y  = np.copy(np.concatenate([events,events],-1))
    y[:,:,0] = tr.padded_events_to_tte(np.squeeze(events),discrete_time=True)
    y[:,:,1] = tr.padded_events_to_not_censored(np.squeeze(events),discrete_time=True)

Should my training data tensor be of size N x (K_d+K_s) x T? If that is the case (this seem like the most "complete" representation of the data, taking padding into account), how\where do I put event\censoring indicators?

ragulpr commented 6 years ago

Your data should consist of two tensors, one for the features (x) and one for the tte and (non)censoring indicators (y).

If you only have data on month-time resolution then your final tensor x should have shape N x T x (K_d+K_s).

To get y I recommend creating a temporary representation of the events and import functions (as in the snippet you sent) that works on it to get tte and non-censoring indicators.

Check below; This function will take a dataframe of a specified format and extract two columns from it and place it on a timestep specified by the t_elapsed-column assumed to be in the dataframe.

import wtte.transforms as tr
x = tr.df_to_padded(df=df,column_names=['feature1', `feature2`],t_col='t_elapsed')

What you probably want to do is to add an 'event' column which has 1s on timesteps where the patient dies. If they didn't die yet just leave it zero. Then apply events = tr.df_to_padded(df=df,column_names=['event'],t_col='t_elapsed') to get it to tensor format

I recommend you reading both of the data-pipeline examples in detail or the tests to get a feel for the steps needed. There's honestly a lot of pitfalls when working with this type of data.

There's also the documentation. Did this help you anything?

adam-haber commented 6 years ago

It does! I'm slowly making progress... I think. :-) I've used x = tr.df_to_padded(df=df,column_names=['feature1', 'feature2'],t_col='t_elapsed') and indeed got an N x T x (K_d+K_s) tensor; two things that I still don't understand:

  1. I expected that, for example, x[0,1,2] would correspond to the value of the 3rd feature of the 1st patient in time T=2 (up to 0/1 conventions) - however, this wasn't the case.
  2. The padding was with nan - is that OK?

Same goes with events - I did events = tr.df_to_padded(df=df,column_names=['event'],t_col='t_elapsed') and got an N x T x 1 tensor; I expected events[0,1,0] to be the event indicator of the 1st patient in the 2nd time interval, but I didn't. Moreover, some events[i,:,:] had an indicator even though patient i didn't die.

ragulpr commented 6 years ago
  1. That sounds like what it should do yes, what's the output instead?

  2. Yes, I add nan-padding initially which is great for preprocessing/plotting purposes but you need to change it (i.e replace with some mask value) when feeding into the network. It's handled in the data-pipeline-examples

  3. It'll probably sort by the sequence_index (i.e patient i) you specify to be the column pointing out the 'id' (by default "id") and that's the confusion. The call to df_to_padded is actually:

df_to_padded(df, column_names, id_col='id', t_col='t')

See https://github.com/ragulpr/wtte-rnn/blob/master/python/wtte/transforms.py#L100

So there's probably something wrong with what you assume to be the ordering over the batch-index :)