hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.28k stars 548 forks source link

How to regulate Negative values in the Generated Data #231

Open Bhargav-Ravinuthala opened 1 week ago

Bhargav-Ravinuthala commented 1 week ago

Related Issues: https://github.com/hitsz-ids/synthetic-data-generator/issues/189

This is just a follow-up for the closed issue with respect to handelling negative values in the Generated Dataset.

Problem: Seeing Negative Values in the generate Data for the Postive Columns from the user input.

How to Recreate

Sample Data Book1.csv

Code:#which I recreated from the orginal class

import pandas as pd
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_processors.filter.positive_negative import PositiveNegativeFilter
from sdgx.data_models.metadata import Metadata

# Create data connector for csv file
data_connector = CsvConnector(path=r"C:\Users\Bhargav\Downloads\Book1.csv")

# Initialize synthesizer
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=300),  # For quick demo
    data_connector=data_connector
)

# Read the original data to analyze columns
original_data = pd.read_csv(r"C:\Users\Bhargav\Downloads\Book1.csv")

# Create metadata from original data
metadata = Metadata.from_dataframe(original_data)

# Initialize and configure the PositiveNegativeFilter
pos_neg_filter = PositiveNegativeFilter()
pos_neg_filter.fit(metadata)

# Add the filter to the synthesizer's pipeline
synthesizer.add_processor(pos_neg_filter)

# Fit the model
synthesizer.fit()

# Sample synthetic data
sampled_data = synthesizer.sample(1000)

# Save sampled data to CSV
output_path = r"C:\Users\Bhargav\Downloads\synthetic_data.csv"
sampled_data.to_csv(output_path, index=False)
print(f"Synthetic data saved to {output_path}")

# Print information about preserved value ranges
for column in sampled_data.columns:
    if pd.api.types.is_numeric_dtype(original_data[column]):
        original_min = original_data[column].min()
        original_max = original_data[column].max()
        synthetic_min = sampled_data[column].min()
        synthetic_max = sampled_data[column].max()

        print(f"\nColumn: {column}")
        print(f"Original range: [{original_min}, {original_max}]")
        print(f"Synthetic range: [{synthetic_min}, {synthetic_max}]")

Also Tested with the orginal code#

import pandas as pd
from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.utils import download_demo_data

# This will download demo data to ./dataset
dataset_csv = download_demo_data()

# Create data connector for csv file
data_connector = CsvConnector(path=r"C:\Users\Bhargav\Downloads\Book1.csv")

# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    model=CTGANSynthesizerModel(epochs=1),  # For quick demo
    data_connector=data_connector,
)

# Fit the model
synthesizer.fit()

# Sample synthetic data
sampled_data = synthesizer.sample(1000)

# Save sampled data to CSV
output_path = r"C:\Users\Bhargav\Downloads\synthetic_data.csv"
sampled_data.to_csv(output_path, index=False)

print(f"Synthetic data saved to {output_path}")

Expected Behavioure

Values polarities should follow the orginal Data.

Related PR's https://github.com/hitsz-ids/synthetic-data-generator/pull/217

Wh1isper commented 1 week ago

Thanks for reporting this, @jalr4ever would you like to take a look?

jalr4ever commented 1 week ago

@Wh1isper Yeah, I have also noticed this issue recently. I will catch this.

Wh1isper commented 6 days ago

@jalr4ever Can you confirm this issue has been fixed in #232? I think we should make a release for this.

jalr4ever commented 6 days ago

@Wh1isper Yeah!The next release is just around the corner! 🎉

jalr4ever commented 5 days ago

@Bhargav-Ravinuthala Hi, We've just dropped a new release, check out version 0.2.2! You can use your original code, and the internal SDG will ensure the properties of positive and negative values. No need to manually add filters!

Bhargav-Ravinuthala commented 5 days ago

Any way i can callobrate? i have been brain stroming your entire code, We are trying to use your code for one of the project and we need to fix this....

jalr4ever commented 4 days ago

@Bhargav-Ravinuthala Hi, did you try the new release? Didn't it resolve the Bug? I have tested you data in local by 0.2.2, its seems Okay.🧐