hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Enhance Numeric Data Inspection and Introduce Positive/Negative Filtering #217

Closed MooooCat closed 2 months ago

MooooCat commented 2 months ago

Enhance NumericInspector and Implement PositiveNegativeFilter

Description

This PR introduces significant enhancements to the Synthetic Data Generator (SDG) framework, specifically in the NumericInspector class and the addition of a new PositiveNegativeFilter class. The NumericInspector has been updated to support the identification of both positive and negative numeric columns, improving the quality of synthetic data generation. The PositiveNegativeFilter class is designed to filter data based on the positivity or negativity of values in specified columns, ensuring that the integrity of the data is maintained during processing.

Key changes include:

Motivation and Context

The motivation behind these changes is to enhance the data quality assurance mechanisms within the SDG framework. By allowing the identification of positive and negative columns, we can ensure that the synthetic data generated meets specific criteria, which is crucial for various applications such as model training and data sharing. This change addresses the need for more robust data validation and filtering capabilities, ultimately leading to better performance and reliability of the generated synthetic data.

How has this been tested?

The changes have been thoroughly tested using a dedicated test suite. The following tests were performed:

Types of changes

Checklist:

MooooCat commented 2 months ago

@jalr4ever Please help me review this PR.

MooooCat commented 2 months ago

Modified the code according to the suggestions in the code review, and all unit tests have passed.