Closed VHolstein closed 1 year ago
Hello @VHolstein
Thank you for your feedback! Indeed, the docs for the time series could be improved to clarify your points.
There are two types of dataloaders for time series:
TimeSeriesDataloader
temporal_data
is a list of dataframes for each subject. Each dataframe contains a set of observations/measurements. The index of the dataframes can be anything.observation_times
: A list of arrays that maps directly to the index of each dataframe in temporal_data
. It's when each measurement was taken.statc_data
is a DataFrame of static features for each subject, like gender, city, etc.outcome
is a dataframe that can be for anything : labels, regression outcome, forecasting etc.temporal_data
, observation_times
, statc_data
, outcome
must have the same lengthTimeSeriesSurvivalDataLoader
temporal_data
is the same as TimeSeriesDataloader
observation_times
is the same as TimeSeriesDataloader
statc_data
is the same as TimeSeriesDataloader
.outcome
is a tuple of the form (T, E) - T is time-to-event, and E is censored
/event
. This is mainly because survival problems require special models for performance evaluation.temporal_data
, observation_times
, statc_data
, outcome
must have the same length.In the generic use cases, TimeSeriesDataloader
should work fine, and the downstream evaluation can be for classification/regression tasks. With TimeSeriesSurvivalDataLoader
, the performance evaluation will be done using survival models.
Hopefully, this clarifies a bit. We will try to improve the docs here.
Any contribution or tutorial would be greatly appreciated. Thank you!
Thanks for the swift and clear reply! Just added a small tutorial and some documentation - hope this helps the project a little. Feel free to modify the notebook however you want. If you think it's useful you could add it to the featured tutorials in the readme
Thank you for your contribution, @VHolstein !
Question
Which input format is required for time series data?
Further Information
Dear SynthCity developers, I really like your work and wanted to test out the package on my own time series dataset. I have a dataset with phone data consisting of passive sensing, sampled daily with some days missing for some individuals. Number of days of collected data varies between individuals. To familiarize me with the required input format I went through the PBC dataset.
As far as I understand,
temporal_data
is a list of dataframes of variable length containing variables of interest and time as an index column. Theobservation_times
is a list of lists with the timestamps for each observation in a list.outcome
is a tuple with two series of outcomes, andstatic_data
is just a dataframe.If I understand correctly I'd have to split temporal features into multiple dataframes, make timestamps the index and put these in a list. Then I'd generate lists of the timestamps for each dataframe add them to the list of observation times and select a list of outcome and static features with the same ordering as the two lists. Before I mess up the analysis, is there anything I'm missing here?
If this works out I'd be willing to write a short tutorial on this - could help other labs import their own data.