sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Force timeseries:PAR to use specified times #684

Open nhenscheid opened 2 years ago

nhenscheid commented 2 years ago

Problem Description

When the PAR model generates new synthetic data, it generates timestamps/observation times randomly according to some (learned) distribution. It would be ideal to be able to specify an exact distribution of timepoints, or at least, specify a starting and ending timepoint (for instance, all times start at e.g. T=0, 1/1/2022, etc.)

Expected behavior

Given a PAR model, PAR.sample() should allow for a "time" input such as time = [0,1], time = range(10) (specifying a time interval or specific time points)

Additional context

<Please provide any additional context that may be relevant to the issue here. If none, please remove this section.>

npatki commented 2 years ago

Thanks for filing @nhenscheid. We'll keep this issue open and update it whenever we work on it.

It would help us prioritize if you could share a bit more about your use case. Are you working on a specific project that requires this feature? How do you plan to use the synthetic data?

nhenscheid commented 2 years ago

@npatki The use case is creating synthetic patient records for clinical trial simulations. Clinical trials typically have very specific follow-up schedules, perhaps plus-or-minus a few days. So each patient should have records at baseline/screening, zero weeks, N weeks (plus or minus a couple days), 2N weeks (plus or minus), etc.

The training data can be heterogeneous, coming from multiple trials with different follow-up schedules, so the learned timepoint distribution isn't very useful.

npatki commented 2 years ago

Thanks for the description @nhenscheid. Just a few more questions --

So each patient should have records at baseline/screening, zero weeks, N weeks (plus or minus a couple days), 2N weeks (plus or minus), etc.

If the intervals aren't strict (they have plus/minus a few days), this seems like something the PARModel should already be able to learn. Is this not the case in the synthetic data?

specify a starting and ending timepoint (for instance, all times start at e.g. T=0, 1/1/2022, etc.)

So right now, the PARModel is modeling the starting and ending points as absolute values. For eg, if the starting sequences are generally between 1/1/2022 and 6/1/2022 in the real data, the same should be generally true of the synthetic data (plus or minus). Is your use case to create sequences that are outside the observed ranges?

nhenscheid commented 2 years ago

If the intervals aren't strict (they have plus/minus a few days), this seems like something the PARModel should already be able to learn. Is this not the case in the synthetic data?

The issue is that the time intervals are heterogeneous across different cohorts within the training set, so it learns the overall distributions. For example one study might have 6 month sampling interval (plus or minus a few days) and another study might have 3 month interval, or something different. So it learns a distribution for the whole training set, but the synthetic samples end up being irregular i.e. inconsistent follow-up times (say 0 months, 3mo, 9mo, 12mo, 24mo).

Is your use case to create sequences that are outside the observed ranges?

No, typically the time points will interpolate within the training bounds (to prevent extrapolation error). So for instance, if all the studies start 1/1/20 and end 1/1/22 (but with different sampling intervals), we would want to generate synthetic data also from 1/1/20 to 1/1/22 but with a regular sampling interval (e.g. 3 mo).

npatki commented 2 years ago

Got it, thanks for the details.

So it learns a distribution for the whole training set, but the synthetic samples end up being irregular i.e. inconsistent follow-up times (say 0 months, 3mo, 9mo, 12mo, 24mo).

Just to make sure, are you using the entity_columns parameter when setting up your PAR model? For such a use case it would be essential to break up the dataset into multiple sequences -- one for each study. (For more info, see User Guide.)