sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.33k stars 307 forks source link

Fixed combinations Constraint #2253

Open Pavan-Kalyan1432 opened 1 week ago

Pavan-Kalyan1432 commented 1 week ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

What I already tried

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer, CTGANSynthesizer, TVAESynthesizer
import pandas as pd
import os

real_data = pd.read_csv('data//BILLING.csv').fillna("")
real_data = real_data.dropna(axis=1, how='all')
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=real_data)
metadata.update_columns_metadata(
    {
        "First Name":{"sdtype":"categorical"},
        "Last Name":{"sdtype":"categorical"},
        "Middle Name":{"sdtype":"categorical"},
        "Full Name":{"sdtype":"categorical"},
        "Date of Birth":{"sdtype":"date"},
        "National ID":{"sdtype":"categorical"}
    }
)

metadata.update_column("Phone Number", pii=False)

metadata.remove_primary_key()

path = 'output//metadata.json'
if os.path.exists(path):
    os.remove(path)
metadata.save_to_json(path)

my_constraint = {
    'constraint_class' : "FixedCombinations",
    'constraint_parameters' : {
        'column_names' : ['First Name', 'Middle Name', 'Last Name', 'Full Name']
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints(constraints=[my_constraint])
synthesizer.fit(real_data)

for col in real_data.columns:
    null_count = real_data[col].isnull().sum()
    empty_string_count = (real_data[col] == "").sum()
    total_nulls = null_count + empty_string_count
    total_cells = real_data.shape[0]  
    null_percentage = (total_nulls / total_cells) * 100 if total_cells > 0 else 0
    null_percent = null_percentage.round(2)
    print(f"{col} - {null_percent}%")

s = []

while True:
    column = input("Enter the column name to fix (or 'exit' to stop): ")
    if column == "exit":
        break
    if column not in real_data.columns:
        print("Column not found")
        continue
    s.append(column)

if s:
    fixed_columns = real_data[s]
    synthetic_data = synthesizer.sample_remaining_columns(fixed_columns, max_tries_per_batch=200)
else:
    synthetic_data = synthesizer.sample(num_rows=50)

synthetic_data.to_csv('output//synthetic_data_1.csv', index=False)

Here Fixed combinations is repeating the combinations but it is not considering all the combinations... What to do to make it consider all the combinations of first name, middle name, last name and full name of the real data

srinify commented 1 week ago

Hi @Pavan-Kalyan1432 can you clarify what you mean by "repeating the combinations but it is not considering all the combinations"?

When generating synthetic data, using this constraint will ensure that the synthesizer will only use the same combinations of values in these 4 columns that exist in your real data. So, for example, if you only have rows containing the combination: "Jack", "John", "Jay", and "Jack John Jay" for your 4 columns, then this will be the only combination that will show up in the synthetic data.

Pavan-Kalyan1432 commented 1 week ago

For example it is repeating the same combination multiple times and also it is not considering all the combinations that are in real data

npatki commented 1 week ago

Hi @Pavan-Kalyan1432, if I may jump in here: The purpose of the FixedCombinations constraint is only to fix the combinations that are created. Adding this constraint will prevent new permutations from being synthesized in the columns you specify.

If you sample many many more times, then I think due to random chance, you will eventually end up creating all the combinations that were in the original data.

However, preventing repetition is not the purpose of this constraint. May I ask why you want to prevent the repetition in your data? This indicates to me that in your synthetic data, you just want the same exact same names to appear in the exact same rows as your real data. Is that correct? If you could provide more information on your usage (what are you trying to accomplish with synthetic data), we can better guide you to a solution. Thanks.