sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 305 forks source link

Improve categorical column quality: Use label encoding (+noise) #583

Closed katxiao closed 1 year ago

katxiao commented 3 years ago

Problem Description

When modeling a categorical column with two categories, sometimes the less frequent value is not sampled.

Expected behavior

Modeling the binary categorical column as a boolean column seems to fix this issue, so that both values are sampled.

npatki commented 2 years ago

Right now, our definition of a boolean sdtype is that it must contain True/False values. Per this definition, modeling any binary categorical column as boolean would not be allowed in the future.

We could extend the definition of boolean to allow for multiple representations, but I don't think this is necessary: Boolean columns essentially use label encoding to transform values to 0/1. A binary categorical column will behave exactly the same if you use a label encoding on it.

Generally speaking, I agree that label encoding leads to more accurate data. Adding noise to label encoding has been proven to increase the quality even more. Let's update this issue into a feature request for making label encoding (+noise) the default. This will be possible when we integrate with RDT 1.0.

npatki commented 1 year ago

This issue has now been resolved, as SDV 1.0 now assigns Label Encoding (with noise) by default to all categoricl and boolean columns.