hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Bugfix: Update Fit Methods in Data Processors #211

Closed MooooCat closed 3 months ago

MooooCat commented 3 months ago

Description

This pull request introduces updates to the fit methods across several data processors within the SDG framework. Specifically, the changes involve:

  1. EmptyTransformer: Changed the empty_columns attribute from a list to a set for improved performance and uniqueness. Updated the fit method to populate this set based on the metadata's identification of empty columns.

  2. NaNTransformer: Enhanced the fit method to accurately record numeric columns (integer and float) by iterating through the metadata and adding columns to their respective sets only if they match the expected data type.

  3. NumericValueTransformer: Updated the fit method to correctly identify and record integer and float columns by checking each column's data type against the metadata.

  4. OutlierTransformer: Similar to the NaNTransformer, the fit method was updated to accurately record integer and float columns by verifying their data types against the metadata.

Motivation and Context

This change is required to improve the accuracy and efficiency of the data processors' fit methods.

By updating these methods, we ensure that the data processors are more reliable and performant, leading to better synthetic data generation.

How has this been tested?

The changes have been tested through unit tests that verify the correctness of the updated fit methods.

Types of changes

Checklist: