Closed katxiao closed 1 year ago
Right now, our definition of a boolean
sdtype is that it must contain True/False
values. Per this definition, modeling any binary categorical column as boolean would not be allowed in the future.
We could extend the definition of boolean
to allow for multiple representations, but I don't think this is necessary: Boolean columns essentially use label encoding to transform values to 0/1. A binary categorical column will behave exactly the same if you use a label encoding on it.
Generally speaking, I agree that label encoding leads to more accurate data. Adding noise to label encoding has been proven to increase the quality even more. Let's update this issue into a feature request for making label encoding (+noise) the default. This will be possible when we integrate with RDT 1.0.
This issue has now been resolved, as SDV 1.0 now assigns Label Encoding (with noise) by default to all categoricl and boolean columns.
Problem Description
When modeling a categorical column with two categories, sometimes the less frequent value is not sampled.
Expected behavior
Modeling the binary categorical column as a boolean column seems to fix this issue, so that both values are sampled.