sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

`NewRowSynthesis` fails with category data type #397

Open pvk-developer opened 1 year ago

pvk-developer commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

When running DiagnosticReport or NewRowSynthesis by itself, we get the following error if a categorical column in the real_data is represented as category or something different than object.

pandas.errors.UndefinedVariableError: name <value> is not defined

Steps to reproduce

In order to reproduce this, we can use category as data type in the real_data, here is a short example in order to reproduce it:

from sdmetrics.demos import load_single_table_demo
from sdmetrics.single_table import NewRowSynthesis

real_data, synthetic_data, metadata = load_single_table_demo()
real_data['gender'] = real_data['gender'].astype('category')

NewRowSynthesis.compute_breakdown(real_data, synthetic_data, metadata)

.................
File ~/.virtualenvs/SDMetrics/lib/python3.8/site-packages/pandas/core/computation/scope.py:246, in Scope.resolve(self, key, is_local)
    244     return self.temps[key]
    245 except KeyError as err:
--> 246     raise UndefinedVariableError(key, is_local) from err

UndefinedVariableError: name 'F' is not defined

Additional context

This bug occurs because of the following if else statement:

https://github.com/sdv-dev/SDMetrics/blob/585290fc829db32645c1231d5b0385b9e90a0a4c/sdmetrics/single_table/new_row_synthesis.py#L120-L123

In order to fix this we have to accurately detect the data type and use the proper representation of the object.