Discrete columns cause memory overflow issues during CTGAN processing.

jalr4ever commented 3 weeks ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

SDV version: 1.15.0
Python version: 3.11.9
Operating System: MacOS 14.5

Problem description

Hi, sdv. I've recently encountered some confusion regarding PII columns while using CTGAN to generate data. I have two questions:

Discrete column masking issue - In my business, I've noticed that some addresses and random IDs were not included in the model training. The generated models contain placeholder random strings for PII information, but I need these columns to reflect actual data instead of just simple placeholders.
PII column labeling logic issue - I saw in sdv/single_table/ctgan.py that when the number of columns exceeds 1000, a warning appears about handling discrete columns. My understanding is that if SDV detects 1000 different values in a data row, it automatically labels this column as containing PII information? (I also see "unknown" indicated in the metadata type.)

What I already tried

Anyway, I converted my discrete data into categorical format for training with SDV. However, when my dataset grew to 100,000 rows and 35 columns (including 5 identified by SDV as unknown and PII), the training process ran out of memory.

from datetime import datetime

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer

file_name = '20240819094312_37_10000.csv'
current_time = datetime.now().strftime("%Y%m%d%H%M%S")

real_data = pd.read_csv(file_name)
print(real_data.head(10))

# Create metadata
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# Turn pii to categorical
for key, value in metadata.columns.items():
    if 'pii' in value and value['pii'] is True:
        del value['pii']
        value['sdtype'] = 'categorical'  # back to categorical

synthesizer = CTGANSynthesizer(
    metadata,
    epochs=2,
    verbose=True
)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=1000)
print(synthetic_data.head())

Do you have any suggestions for addressing these two issues? I look forward to your response.

srinify commented 3 weeks ago

Hi @jalr4ever do you mind sharing your metadata and some more context on your use case / problem domain -- that will help me build some context on your data better and give you better suggestions!

In general, my advice would be to update your metadata. The baseline auto-detection is a starting point, but you can update using the update API.

Regarding addresses / IDs being different in the synthetic data

The SDV handles PII and ID columns a bit differently than datetime, numerical, and categorical columns.

PII Columns

Because PII columns contain sensitive information usually, the SDV replaces values with entirely new ones. You can get SDV to generate more relevant synthetic value for these columns by updating the sdtype to a more granular PII sdtype:

# Postal / Zip Code
metadata.update_column(column_name="postal_code",  sdtype="postcode")

# IP Address (v4)
metadata.update_column(column_name="IP_address", sdtype="ipv4_address")

If you need to maintain the same ratios of values in your synthetic data (e.g. similar distribution of postal codes, states, etc), I recommend reading up on Contextual Anonymization where the SDV analyzes PII columns more deeply and actively tries to maintain the same proportions (e.g. of states) in the synthetic data. FYI a lot of these features are only available in the SDV Enterprise.

ID Columns

In SDV-land, ID columns don't contain useful mathematical properties and primarily exist to uniquely identify rows. You can optionally provide a regular expression format if you want the synthetic ID values to follow a specific pattern that's similar to your real data:

# Example regex
metadata.update_column(
column_name: "product_code",
sdtype: "id",
regex_format: "[0-9]{4}-[0-9]{4}"
)

jalr4ever commented 3 weeks ago

Hi @jalr4ever do you mind sharing your metadata and some more context on your use case / problem domain -- that will help me build some context on your data better and give you better suggestions!

In general, my advice would be to update your metadata. The baseline auto-detection is a starting point, but you can update using the update API.

Regarding addresses / IDs being different in the synthetic data

The SDV handles PII and ID columns a bit differently than datetime, numerical, and categorical columns.

PII Columns

Because PII columns contain sensitive information usually, the SDV replaces values with entirely new ones. You can get SDV to generate more relevant synthetic value for these columns by updating the sdtype to a more granular PII sdtype:
# Postal / Zip Code
metadata.update_column(column_name="postal_code",  sdtype="postcode")

# IP Address (v4)
metadata.update_column(column_name="IP_address", sdtype="ipv4_address")
If you need to maintain the same ratios of values in your synthetic data (e.g. similar distribution of postal codes, states, etc), I recommend reading up on Contextual Anonymization where the SDV analyzes PII columns more deeply and actively tries to maintain the same proportions (e.g. of states) in the synthetic data. FYI a lot of these features are only available in the SDV Enterprise.

ID Columns

In SDV-land, ID columns don't contain useful mathematical properties and primarily exist to uniquely identify rows. You can optionally provide a regular expression format if you want the synthetic ID values to follow a specific pattern that's similar to your real data:
# Example regex
metadata.update_column(
column_name: "product_code",
sdtype: "id",
regex_format: "[0-9]{4}-[0-9]{4}"
)

@srinify Thanks for reply.

20240821152459_100000_sample40000.csv

This is a mock dataset containing approximately 40,000 rows in CSV format and over 30 columns. Columns 2, 7, 15, and 16 contain my business data but have many categories; The other columns are either numerical or string types.

When I input this into SDV CTGAN, it identified columns 2, 7, 15, and 16 as PII columns of an unknown type. Forcing them to be recognized as categorical resulted in a memory overflow.

Do you have any suggestions? If I only want to generate business data for columns 2, 7, 15, and 16 (specifically pii_), do I need to write regular expressions myself and provide them to SDV?

srinify commented 3 weeks ago

Thanks for sharing the sample @jalr4ever Is the error occurring when you're fitting the synthesizer?

If I only want to generate business data for columns 2, 7, 15, and 16 (specifically pii_)

The key consideration here is -- do you need the exact same values to be in your synthetic data? Then identifying them as the categorical sdtype is the key. If you instead want to anonymize PII values and it's not important to replicate the same values, PII sdtypes are your best bet.

If they need to be Categorical / replicated in the synthetic data -- I have 2 things for you to consider:

Using GaussianCopulaSynthesizer instead of CTGANSynthesizer. In many cases, the model fitting should be faster and require less resources. CTGAN is expensive to train, especially with lots of rows and columns, because generative adversarial networks take a while to train.
Using a subsample of your data. We've worked on many projects where people were able to use a 1-10% sample of their data to still get high quality synthetic data. We wrote about that here: https://datacebo.com/blog/sdv-training-subsample/

jalr4ever commented 2 weeks ago

Thanks for sharing the sample @jalr4ever Is the error occurring when you're fitting the synthesizer?

If I only want to generate business data for columns 2, 7, 15, and 16 (specifically pii_)

The key consideration here is -- do you need the exact same values to be in your synthetic data? Then identifying them as the categorical sdtype is the key. If you instead want to anonymize PII values and it's not important to replicate the same values, PII sdtypes are your best bet.

If they need to be Categorical / replicated in the synthetic data -- I have 2 things for you to consider:

Using GaussianCopulaSynthesizer instead of CTGANSynthesizer. In many cases, the model fitting should be faster and require less resources. CTGAN is expensive to train, especially with lots of rows and columns, because generative adversarial networks take a while to train.

Using a subsample of your data. We've worked on many projects where people were able to use a 1-10% sample of their data to still get high quality synthetic data. We wrote about that here: https://datacebo.com/blog/sdv-training-subsample/

Thanks for your suggestion, the test result for me that GaussianCopulaSynthesizer shows a great performance in this secrenario!

jalr4ever commented 2 weeks ago

@srinify I notice that GaussianCopula performs well by using UniformEncoder for frequency encoding, while CTGAN employs OneHot encoding for categorical variables, leading to high dimensionality and memory issues. Why not use UniformEncoder for categorical variables in CTGAN? 🤔

srinify commented 2 weeks ago

@jalr4ever that's a good question, I'm actually not 100% and will look into it a bit more to understand why the logic for choosing transformers in CTGANSynthesizer differs here!

You can always update the transformers used manually by the way, which you may already know about: https://docs.sdv.dev/sdv/single-table-data/modeling/customizations/preprocessing#update_transformers

sdv-dev / SDV