hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Add ConstInspector and ConstValueTransformer for Handling Constant Columns #202

Closed MooooCat closed 3 months ago

MooooCat commented 3 months ago

Description

This pull request introduces several enhancements and fixes to the Synthetic Data Generator (SDG) framework, focusing on the handling of constant columns in tabular data. The changes include:

Motivation and Context

This change is required to improve the quality and utility of the synthetic data generated by the SDG framework.

By identifying and handling constant columns, we ensure that the synthetic data maintains the integrity of the original data.

This enhancement also addresses the need for more robust data transformation capabilities, allowing for more accurate and controlled generation of synthetic data.

How has this been tested?

The changes have been thoroughly tested using unit tests that cover the new functionality introduced by ConstInspector and ConstValueTransformer.

Types of changes

Checklist:

MooooCat commented 3 months ago

We have observed that initializing different Metadata within the same function or the same batch of unit tests seems to interfere with each other, leading to inaccurate table metadata. This might be a bug, and we should create a separate Issue and PR to address it.

For example, we can look at the error in the test , in tests/data_models/test_metadata.py::test_demo_multi_table_data_metadata_parent. This test is intended for a multi-table dataset, but the metadata includes columns from the single-table dataset adult.csv, i.e. {'workclass', 'fnlwgt', 'age'}. This issue could be caused by the metadata or the inspector.