hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Enhance Data Handling with Empty Column Inspector and Transformer #197

Closed MooooCat closed 3 months ago

MooooCat commented 3 months ago

Description

I have implemented a new EmptyInspector class in sdgx/data_models/inspectors/empty.py to identify columns in a DataFrame that have a high rate of missing values. This class tags these columns as empty and removes them during the training process, reinserting them into their original positions after the model sampling process is complete.

Additionally, I have added an EmptyTransformer class in sdgx/data_processors/transformers/empty.py to handle these empty columns by removing them during data conversion and restoring them during reverse conversion.

Motivation and Context

This change is required to improve the handling of missing data in our data processing pipeline.

By identifying and handling empty columns explicitly, we can prevent issues related to missing data during model training and ensure that these columns are correctly reinserted after processing.

This enhancement ensures data integrity and improves the robustness of our models.

How has this been tested?

I have added a new test module tests/data_models/inspector/test_empty.py to verify the functionality of the EmptyInspector class. The test checks if the inspector correctly identifies columns with high missing values and ensures that the inspection level is set correctly. The test environment includes a sample DataFrame with specific columns set to missing values to simulate the scenario.

After all automated tests in the PR have passed, I will then add test cases for the EmptyTransformer.

Types of changes

Checklist: