sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 304 forks source link

Is it possible to specify a distribution that one or more columns need to follow? #2025

Closed b-a0 closed 4 months ago

b-a0 commented 4 months ago

Environment details

N/A

Problem description

My problem revolves around the creation of synthetic population data from aggregate census statistics. The synthetic data would have population characteristics of individual agents, the distribution of those characteristics across the population should match the distribution found in the aggregate census statistics.

In a highly simplified situation the input would for example be:

Desired output would be a table with N rows (where N is the population size) with columns:

What I already tried

N/A
srinify commented 4 months ago

Hi there @b-a0 SDV is designed to generate synthetic data that's similar to the real data that you provide the model. If you don't have existing data to feed to SDV for model training, then SDV is likely not the right solution here!

If you already have a contingency table, you could write a custom script to turn it into a dataset by iterating over all the rows. ChatGPT gave this sample code to do just that which you can adapt to your liking:

import pandas as pd
import numpy as np

# Sample contingency table as a DataFrame
data = {
    'Age Group': ['20-29', '20-29', '20-29', '30-39', '30-39', '30-39', '40-49', '40-49', '40-49'],
    'Income Level': ['Low', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'High'],
    'Count': [3, 5, 2, 4, 3, 6, 5, 2, 4]
}

df = pd.DataFrame(data)

# De-aggregate the table
de_aggregated_data = []

for _, row in df.iterrows():
    age_group = row['Age Group']
    income_level = row['Income Level']
    count = row['Count']

    de_aggregated_data.extend([[age_group, income_level]] * count)

# Create a new DataFrame from the de-aggregated data
de_aggregated_df = pd.DataFrame(de_aggregated_data, columns=['Age Group', 'Income Level'])

# Add a unique ID column without duplicates
np.random.seed(42)  # For reproducibility
unique_ids = np.arange(1, len(de_aggregated_df) + 1)
np.random.shuffle(unique_ids)
de_aggregated_df['ID'] = unique_ids

# Display the de-aggregated DataFrame with the unique ID column
print(de_aggregated_df)

Let me know if I understood your use case or if I misunderstood something :) @b-a0

b-a0 commented 4 months ago

Thanks, you've understood my use case perfectly. I will try alternate methods to SDV.