sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.23k stars 293 forks source link

Support for specifying a range during conditional sampling #1843

Open srinify opened 4 months ago

srinify commented 4 months ago

Inspired by this issue originally: https://github.com/sdv-dev/SDV/issues/1833

After quick discussion with Neha, we're opening this feature request. Currently, you can specify very specific criteria during conditional sampling (weight is 50) but you can't specify a range of values (e.g. weight from 50 to 200).

npatki commented 4 months ago

Workaround

For anyone blocked by this, you can use the code snippet below. This code will sample a lot of rows (unbounded) and then filter out afterwards to a specific range.

# TODO: input the conditions you need
COL_NAME = 'my_column_name'
LOW_RANGE = 18.0 # minimum possible value in range
HIGH_RANGE = 100.0 # maximum possible value in range

# Request more rows than you need. Maybe 1,000 if you need 100 true rows.
synthetic_data = synthesizer.sample(1000)

# Filter out rows to within the range
filtered_synthetic_data = synthetic_data[synthetic_data[(synthetic_data[COL_NAME] >= LOW_RANGE) & (synthetic_data[COL_NAME] <= HIGH_RANGE)]
adib0073 commented 3 months ago

Thanks @npatki for sharing the workaround code. Can such conditions be defined even before generating the samples? I think it would be better to have something like generate with conditions (different from generating with constraints) to avoid unnecessary computation time in generating and then filtering based on conditions.

npatki commented 3 months ago

Hi @adib0073, unfortunately I cannot think of a good workaround that would allow you to do so right now.

However in the future, when the team adds an actual feature to enable range-based conditional sampling, that is exactly how I envision it working.