This pull request introduces some enhancements to the Synthetic Data Generator (SDG) framework, focusing on improving data quality and handling of specific data anomalies. The key changes include:
Introduction of OutlierTransformer: A new transformer class designed to handle outliers in the data by converting them to specified fill values. This class is equipped to manage outliers in both integer and float columns, replacing them with default fill values (0 for integers and 0.0 for floats).
Enhancements to NonValueTransformer: The NonValueTransformer class has been updated to better handle missing values in a DataFrame. It now differentiates between numeric and non-numeric columns, filling missing values in numeric columns with specified numeric defaults (0 for integers, 0.0 for floats) and non-numeric columns with a default string ('NAN_VALUE').
Documentation Updates: Comprehensive docstrings have been added to both the OutlierTransformer and NonValueTransformer classes, providing clear descriptions of their functionalities, attributes, and methods.
Manager Registration: The OutlierTransformer has been registered with the DataProcessorManager, ensuring it can be utilized within the SDG framework.
Regex Inspector Parameter Update: A minor update to the Regex Inspector's fit method to change the parameter name from raw_data to input_raw_data for clarity and consistency.
DiscreteTransformer Registration: DiscreteTransformer is currently disabled.
Test Cases for OutlierTransformer: Added test cases to validate the functionality of the OutlierTransformer, including handling of outliers in integer and float columns.
Motivation and Context
This change is required to enhance the robustness and reliability of the SDG, particularly in scenarios where data contains outliers or missing values.
By introducing the OutlierTransformer and enhancing the NonValueTransformer, we ensure that the generated synthetic data is of higher quality, suitable for a wider range of applications, and more representative of real-world data anomalies.
How has this been tested?
The changes have been thoroughly tested using automated test cases. Specifically:
OutlierTransformer: Test cases were designed to validate the handling of outliers in integer and float columns, ensuring they are replaced with the correct fill values.
NonValueTransformer: Tests were conducted to verify the differentiation and appropriate filling of missing values in numeric and non-numeric columns.
Types of changes
[ ] Maintenance (no change in code, maintain the project's CI, docs, etc.)
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
Checklist:
[x] My code follows the code style of this project.
[ ] My change requires a change to the documentation.
Description
This pull request introduces some enhancements to the Synthetic Data Generator (SDG) framework, focusing on improving data quality and handling of specific data anomalies. The key changes include:
Introduction of OutlierTransformer: A new transformer class designed to handle outliers in the data by converting them to specified fill values. This class is equipped to manage outliers in both integer and float columns, replacing them with default fill values (0 for integers and 0.0 for floats).
Enhancements to NonValueTransformer: The NonValueTransformer class has been updated to better handle missing values in a DataFrame. It now differentiates between numeric and non-numeric columns, filling missing values in numeric columns with specified numeric defaults (0 for integers, 0.0 for floats) and non-numeric columns with a default string ('NAN_VALUE').
Documentation Updates: Comprehensive docstrings have been added to both the OutlierTransformer and NonValueTransformer classes, providing clear descriptions of their functionalities, attributes, and methods.
Manager Registration: The OutlierTransformer has been registered with the DataProcessorManager, ensuring it can be utilized within the SDG framework.
Regex Inspector Parameter Update: A minor update to the Regex Inspector's
fit
method to change the parameter name fromraw_data
toinput_raw_data
for clarity and consistency.DiscreteTransformer Registration: DiscreteTransformer is currently disabled.
Test Cases for OutlierTransformer: Added test cases to validate the functionality of the OutlierTransformer, including handling of outliers in integer and float columns.
Motivation and Context
This change is required to enhance the robustness and reliability of the SDG, particularly in scenarios where data contains outliers or missing values.
By introducing the OutlierTransformer and enhancing the NonValueTransformer, we ensure that the generated synthetic data is of higher quality, suitable for a wider range of applications, and more representative of real-world data anomalies.
How has this been tested?
The changes have been thoroughly tested using automated test cases. Specifically:
Types of changes
Checklist: