vanderschaarlab / synthcity

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
https://www.vanderschaar-lab.com/
Apache License 2.0
451 stars 61 forks source link

Input format for time series data #127

Closed VHolstein closed 1 year ago

VHolstein commented 1 year ago

Question

Which input format is required for time series data?

Further Information

Dear SynthCity developers, I really like your work and wanted to test out the package on my own time series dataset. I have a dataset with phone data consisting of passive sensing, sampled daily with some days missing for some individuals. Number of days of collected data varies between individuals. To familiarize me with the required input format I went through the PBC dataset.

loader = TimeSeriesDataLoader(temporal_data=temporal,
                                                    observation_times=temporal_horizons,
                                                    outcome=outcome_surv,
                                                    static_data=static_surv)

As far as I understand, temporal_data is a list of dataframes of variable length containing variables of interest and time as an index column. The observation_times is a list of lists with the timestamps for each observation in a list. outcome is a tuple with two series of outcomes, and static_data is just a dataframe.

If I understand correctly I'd have to split temporal features into multiple dataframes, make timestamps the index and put these in a list. Then I'd generate lists of the timestamps for each dataframe add them to the list of observation times and select a list of outcome and static features with the same ordering as the two lists. Before I mess up the analysis, is there anything I'm missing here?

If this works out I'd be willing to write a short tutorial on this - could help other labs import their own data.

bcebere commented 1 year ago

Hello @VHolstein

Thank you for your feedback! Indeed, the docs for the time series could be improved to clarify your points.

There are two types of dataloaders for time series:

  1. TimeSeriesDataloader

    • temporal_data is a list of dataframes for each subject. Each dataframe contains a set of observations/measurements. The index of the dataframes can be anything.
    • observation_times : A list of arrays that maps directly to the index of each dataframe in temporal_data. It's when each measurement was taken.
    • statc_data is a DataFrame of static features for each subject, like gender, city, etc.
    • outcome is a dataframe that can be for anything : labels, regression outcome, forecasting etc.
    • temporal_data, observation_times, statc_data, outcome must have the same length
  2. TimeSeriesSurvivalDataLoader

    • temporal_data is the same as TimeSeriesDataloader
    • observation_times is the same as TimeSeriesDataloader
    • statc_data is the same as TimeSeriesDataloader.
    • outcome is a tuple of the form (T, E) - T is time-to-event, and E is censored/event. This is mainly because survival problems require special models for performance evaluation.
    • temporal_data, observation_times, statc_data, outcome must have the same length.

In the generic use cases, TimeSeriesDataloader should work fine, and the downstream evaluation can be for classification/regression tasks. With TimeSeriesSurvivalDataLoader, the performance evaluation will be done using survival models.

Hopefully, this clarifies a bit. We will try to improve the docs here.

Any contribution or tutorial would be greatly appreciated. Thank you!

VHolstein commented 1 year ago

Thanks for the swift and clear reply! Just added a small tutorial and some documentation - hope this helps the project a little. Feel free to modify the notebook however you want. If you think it's useful you could add it to the featured tutorials in the readme

bcebere commented 1 year ago

Thank you for your contribution, @VHolstein !