hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Enhance: Fix Data Quality with Outlier Handling and Improved Missing Value Treatment #207

Closed MooooCat closed 3 months ago

MooooCat commented 3 months ago

Description

This pull request introduces some enhancements to the Synthetic Data Generator (SDG) framework, focusing on improving data quality and handling of specific data anomalies. The key changes include:

  1. Introduction of OutlierTransformer: A new transformer class designed to handle outliers in the data by converting them to specified fill values. This class is equipped to manage outliers in both integer and float columns, replacing them with default fill values (0 for integers and 0.0 for floats).

  2. Enhancements to NonValueTransformer: The NonValueTransformer class has been updated to better handle missing values in a DataFrame. It now differentiates between numeric and non-numeric columns, filling missing values in numeric columns with specified numeric defaults (0 for integers, 0.0 for floats) and non-numeric columns with a default string ('NAN_VALUE').

  3. Documentation Updates: Comprehensive docstrings have been added to both the OutlierTransformer and NonValueTransformer classes, providing clear descriptions of their functionalities, attributes, and methods.

  4. Manager Registration: The OutlierTransformer has been registered with the DataProcessorManager, ensuring it can be utilized within the SDG framework.

  5. Regex Inspector Parameter Update: A minor update to the Regex Inspector's fit method to change the parameter name from raw_data to input_raw_data for clarity and consistency.

  6. DiscreteTransformer Registration: DiscreteTransformer is currently disabled.

  7. Test Cases for OutlierTransformer: Added test cases to validate the functionality of the OutlierTransformer, including handling of outliers in integer and float columns.

Motivation and Context

This change is required to enhance the robustness and reliability of the SDG, particularly in scenarios where data contains outliers or missing values.

By introducing the OutlierTransformer and enhancing the NonValueTransformer, we ensure that the generated synthetic data is of higher quality, suitable for a wider range of applications, and more representative of real-world data anomalies.

How has this been tested?

The changes have been thoroughly tested using automated test cases. Specifically:

Types of changes

Checklist: