Open MooooCat opened 1 month ago
A minor confusion: why we named them Fixed, which doesn't seem to be very intuitive as to what they are used for.
A minor confusion: why we named them Fixed, which doesn't seem to be very intuitive as to what they are used for.
Regarding the naming of fixed, I found that there are some columns in the table that have a fixed relationship between them, for example: column A is always twice the value of column B, or the values of column C and column D have a fixed correspondence. I currently named it fixed. Do you have any suggestions?
@MooooCat Thank you for the example. Considering the nature of the relationships between the columns, I suggest using the term “deterministic” instead of “fixed.” This term might better capture the predictable and consistent nature of these relationships and won’t be confused with fixes(for bug).
Description
This pull request introduces two new components to the Synthetic Data Generator (SDG) framework:
FixedCombinationInspector
andFixedCombinationTransformer
. These components are designed to identify and handle columns in a DataFrame that have fixed relationships based on high covariance.FixedCombinationInspector: This inspector calculates the covariance matrix of the DataFrame, ignoring NaN values, and identifies columns that have fixed relationships. It stores these relationships in a dictionary attribute called
fixed_combinations
.FixedCombinationTransformer: This transformer processes the metadata to identify columns with fixed relationships and removes them during the conversion process. It also restores these columns during the reverse conversion process.
Additionally, this PR includes unit tests for both the
FixedCombinationInspector
andFixedCombinationTransformer
to ensure their functionality and correctness.Motivation and Context
This change is required to enhance the SDG framework's ability to handle and manage data with fixed column relationships. By identifying and processing these relationships, the framework can generate more accurate and meaningful synthetic data, which is crucial for various applications such as data sharing, model training, and system testing.
How has this been tested?
The changes have been tested using the following methods:
Unit Tests:
test_fixed_combination_inspector.py
: This test file verifies that theFixedCombinationInspector
correctly identifies fixed relationships in a given DataFrame.test_transformers_fixed_combination.py
: This test file ensures that theFixedCombinationTransformer
correctly removes and restores columns with fixed relationships during the conversion and reverse conversion processes.Manual Testing:
Types of changes
Checklist: