Is it possible to specify a distribution that one or more columns need to follow?

b-a0 commented 4 months ago

Environment details

N/A

Problem description

My problem revolves around the creation of synthetic population data from aggregate census statistics. The synthetic data would have population characteristics of individual agents, the distribution of those characteristics across the population should match the distribution found in the aggregate census statistics.

In a highly simplified situation the input would for example be:

A frequency table of age groups
A frequency table of income groups
A contingency table of age and income groups

Desired output would be a table with N rows (where N is the population size) with columns:

Agent ID
Age group
Income group

What I already tried

I've looked into the documentation for constraint logic and custom logic but don't think it allows me to "constrain" the distribution.

N/A

srinify commented 4 months ago

Hi there @b-a0 SDV is designed to generate synthetic data that's similar to the real data that you provide the model. If you don't have existing data to feed to SDV for model training, then SDV is likely not the right solution here!

If you already have a contingency table, you could write a custom script to turn it into a dataset by iterating over all the rows. ChatGPT gave this sample code to do just that which you can adapt to your liking:

import pandas as pd
import numpy as np

# Sample contingency table as a DataFrame
data = {
    'Age Group': ['20-29', '20-29', '20-29', '30-39', '30-39', '30-39', '40-49', '40-49', '40-49'],
    'Income Level': ['Low', 'Medium', 'High', 'Low', 'Medium', 'High', 'Low', 'Medium', 'High'],
    'Count': [3, 5, 2, 4, 3, 6, 5, 2, 4]
}

df = pd.DataFrame(data)

# De-aggregate the table
de_aggregated_data = []

for _, row in df.iterrows():
    age_group = row['Age Group']
    income_level = row['Income Level']
    count = row['Count']

    de_aggregated_data.extend([[age_group, income_level]] * count)

# Create a new DataFrame from the de-aggregated data
de_aggregated_df = pd.DataFrame(de_aggregated_data, columns=['Age Group', 'Income Level'])

# Add a unique ID column without duplicates
np.random.seed(42)  # For reproducibility
unique_ids = np.arange(1, len(de_aggregated_df) + 1)
np.random.shuffle(unique_ids)
de_aggregated_df['ID'] = unique_ids

# Display the de-aggregated DataFrame with the unique ID column
print(de_aggregated_df)

Let me know if I understood your use case or if I misunderstood something :) @b-a0

b-a0 commented 4 months ago

Thanks, you've understood my use case perfectly. I will try alternate methods to SDV.

sdv-dev / SDV