jdb78 / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.85k stars 609 forks source link

Multi Participant Forecasting #503

Open ik362 opened 3 years ago

ik362 commented 3 years ago

Hi there,

I had a similar question to #490 regarding how to code the group_ids for my analysis.

I am analysing time series in the context of medical data i.e. I have many time series rather than one long time series.

My time series come from two cohorts (patients and controls) and in my dataframe I have:

I wanted to know if the 'subj_id' column should be part of the 'group_id' parameter or another parameter?

Thanks!

jdb78 commented 3 years ago

Are your patients switching in and out of the control group? If not, patient as group_id should suffice. The group_ids identify a time series. the time_idx identify each datapoint in a given time series. Not sure what subj_1 are and how it is different from the patient.

ik362 commented 3 years ago

@jdb78 thanks for your help!

My ultimate question is along the lines of: can the time series in the control group be better predicted than the time series of the patients?

Are your patients switching in and out of the control group?

In this sense the patient/control labels are static. In a statistical sense, they are independent samples not repeated measures.

Not sure what subj_1 are and how it is different from the patient

At the moment I have coded the data similar to this:

Screen Shot 2021-05-16 at 1 27 58 pm

So at the moment the subj column is only used as a dummy variable to identify which time_idx corresponds to which subj. I guess my question is: do I need to "double group" time series into subject-level and group-level? Or is the subject-level implied by the time_idx?

jdb78 commented 3 years ago

You can use both group levels group and subj but subj alone will do the job. You might want to consider using group as a static categorical variable on top.

Does this help?

ik362 commented 3 years ago

Hi Jan,

Thanks for getting back to me! I have set up my TimeSeriesDataSet like this: training = TimeSeriesDataSet( df, group_ids = ['subj'], target = 'source2, time_idx = 'time_idx', max_encoder_length = 20, max_prediction_length = 20, time_varying_known_reals = ['time_idx'], time_varying_unknown_reals = ['source1', 'source2', 'source3', 'source4', 'source5',], 'static_categorical = ['group']')

Is this set up correct?

Also, I wanted to ask about how to best compare groups? Would It make sense to use:

predictions, x = best_tft.predict(val_dataloader, return_x=True) predictions_vs_actuals = best_tft.calculate_prediction_actual_by_variable(x, predictions)

And calculate which group had smaller differences between actuals and predictions?

Thanks for your help!

ik362 commented 3 years ago

Hi Jan,

Just to add a little more to my previous post: after running tft and generating predictions I get this figure.

Screen Shot 2021-05-28 at 11 26 07 am

Does it make sense to calculate the difference between actual and prediction for each subject and then do something like a mann-whitney U test to find group-level differences?

Also, as a sub-question: is there a reason why some subj dont have a prediction?

Thanks, Isaac

jdb78 commented 3 years ago

Maybe not all subjects are in the validation set? I wonder if you want to include a variable for distinguish the two groups.

ik362 commented 3 years ago

Hi Jan,

Thanks for getting back to me:

Maybe not all subjects are in the validation set?

I used the standard procedure (from the tutorials) to define the data sets with the code: validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True) batch_size = 16 train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=28) val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size, num_workers=28)

I wonder if you want to include a variable for distinguish the two groups.

Do you mean to set group_ids = ['subj', 'group']?

Thanks, Isaac