hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Handling Fixed Column Relationships using FixedCombinationInspector and FixedCombinationTransformer #219

Open MooooCat opened 1 month ago

MooooCat commented 1 month ago

Description

This pull request introduces two new components to the Synthetic Data Generator (SDG) framework: FixedCombinationInspector and FixedCombinationTransformer. These components are designed to identify and handle columns in a DataFrame that have fixed relationships based on high covariance.

Additionally, this PR includes unit tests for both the FixedCombinationInspector and FixedCombinationTransformer to ensure their functionality and correctness.

Motivation and Context

This change is required to enhance the SDG framework's ability to handle and manage data with fixed column relationships. By identifying and processing these relationships, the framework can generate more accurate and meaningful synthetic data, which is crucial for various applications such as data sharing, model training, and system testing.

How has this been tested?

The changes have been tested using the following methods:

  1. Unit Tests:

    • test_fixed_combination_inspector.py: This test file verifies that the FixedCombinationInspector correctly identifies fixed relationships in a given DataFrame.
    • test_transformers_fixed_combination.py: This test file ensures that the FixedCombinationTransformer correctly removes and restores columns with fixed relationships during the conversion and reverse conversion processes.
  2. Manual Testing:

    • The components were manually tested with various DataFrame configurations to ensure they handle different scenarios correctly.

Types of changes

Checklist:

Wh1isper commented 1 month ago

A minor confusion: why we named them Fixed, which doesn't seem to be very intuitive as to what they are used for.

MooooCat commented 1 month ago

A minor confusion: why we named them Fixed, which doesn't seem to be very intuitive as to what they are used for.

Regarding the naming of fixed, I found that there are some columns in the table that have a fixed relationship between them, for example: column A is always twice the value of column B, or the values ​​of column C and column D have a fixed correspondence. I currently named it fixed. Do you have any suggestions?

Wh1isper commented 1 month ago

@MooooCat Thank you for the example. Considering the nature of the relationships between the columns, I suggest using the term “deterministic” instead of “fixed.” This term might better capture the predictable and consistent nature of these relationships and won’t be confused with fixes(for bug).