Closed frances-h closed 4 days ago
If anyone is running into this, here is a suggested workaround:
categorical
columns (in the metadata) that are actually represented as numbers in your data (ints, floats, etc.)Here is a code snippet that accomplishes the below. Replace the list CAT_COLUMN_NAMES
with the list of your column names.
CAT_COLUMN_NAMES = ['ColA', 'ColB', ... ]
data = <your pandas DataFrame>
metadata = <your SingleTableMetadata object>
# cast the categorical columns to strings
for col_name in CAT_COLUMN_NAMES:
data[col_name] = data[col_name].astype('object')
# now proceed with modeling and sampling as usual
synthesizer = PARSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=10)
# (optional) cast the categorical columns back to floats
for col_name in CAT_COLUMN_NAMES:
try:
synthetic_data[col_name] = synthetic_data[col_name].astype('float')
except:
print('Column name', col_name, 'could not be converted back to a float')
continue
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
When running PAR with categorical columns that are floats, PAR does not stick to the original categories when sampling. This leads to a very low diagnostic score for
'Data Validity'
due to theCategoryAdherence
metric failing.Steps to reproduce