sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

Combination of DataProcessor (in BaseSynthesizer class) with DataTransformer (in e.g. CTGAN) leads to incorrect one-hot encoding for Boolean columns #1528

Closed prabaey closed 1 year ago

prabaey commented 1 year ago

Environment details

Problem description

I'm not sure whether this is a bug or I'm just using the library wrong.

I was trying to train a CTGAN for a simple (artificial) tabular dataset with 5000 records. Here's an example of the first 5 records in my dataframe:

image

Columns "therapy", "smoking", "exercise", "obesity" and "death" are boolean features, while "alcohol" and "stage" are categorical. The remaining ("age", "biomarker" and "genetic_factor") are numerical.

When training the GAN, I quickly ran out of memory, which turned out to be because of the OneHotEncoder in the DataTransformer class called in the fit method of CTGAN. Instead of allocating two columns for the one-hot encoding of the Boolean features, it allocated 5000. Upon further inspection this made sense, since the dataframe that enters the fit function (where the DataTransformer is applied) looks as follows:

image

It's clear that the Boolean columns are no longer Boolean, instead they contain continuous values (which is of course why 5000 different categories are extracted by the OneHotEncoder). The categorical features (alcohol and stage) are still intact.

I found out that this is because of the DataProcessor used in the BaseSynthesizer class. This processor applies a LabelEncoder from the RDT library to the dataframe, where add_noise is set to True for the Boolean columns, but not for the categorical columns. Is this the desired behaviour? As far as I can tell, I can't set my preferences for the add_noise feature when creating an instance of the CTGANSynthesizer class.

I understand that a simple fix for my problem is to encode my Boolean features as Categorical features with 2 classes, but I'm wondering why adding noise to the labels for Boolean columns is the default behaviour of the DataProcessor class. I understand the purpose of adding noise to the labels in some use-cases, though in that case there should be a way to tell the DataTransformer in the CTGAN not to treat these columns as discrete ones, right? I was also surprised that I couldn't find any other issues related to this problem, since I assume I'm not the first one to train a CTGAN on a dataset with Boolean features. Am I missing something? Thanks for your help!

npatki commented 1 year ago

Hi @prabaey, nice to meet you.

The intended usage is to treat boolean columns the same as categorical. I believe most users don't use the 'boolean' sdtype and instead just mark everything as 'categorical', as there is no practical difference between the two. (This is something we can definitely clarify in the docs!) So your workaround would be to update the metadata and mark the relevant columns as ('therapy', 'smoking', 'exercise', 'obesity', 'death') as being a categorical sdtype.

The current treatment of booleans is a bug, so I've filed #1530 to track a fix for it.

BTW:

As far as I can tell, I can't set my preferences for the add_noise feature when creating an instance of the CTGANSynthesizer class.

You can set your transformer preferences by using the update_transformer method, although this will no longer be needed if you follow the workaround above. (For more info, see docs and demo notebook).

prabaey commented 1 year ago

Thank you, I'm glad that's cleared up! I'll use categorical columns from now on like you suggest.

npatki commented 1 year ago

Glad to help!