hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Maintenance: Update CTGAN Example to Use Latest SDG #213

Closed MooooCat closed 3 months ago

MooooCat commented 3 months ago

Description

This pull request updates the sdgx_example_ctgan.ipynb notebook to use the latest version of the Synthetic Data Generator (SDG) from the GitHub repository. The changes include:

  1. Updating the installation command to use the GitHub repository instead of the PyPI package.
  2. Increasing the number of training epochs for the CTGAN model from 2 to 128.
  3. Removing unnecessary data preprocessing steps (remove_empty_rows and clear_na) as they will be integrated into sdgx.processor in the future.
  4. Updating the logging messages and outputs to reflect the changes in the SDG framework.

Motivation and Context

This change is required to ensure that the example notebook uses the latest features and improvements from the SDG framework, which are not yet available in the PyPI package.

By using the GitHub repository, we can leverage the most recent updates and bug fixes.

How has this been tested?

The changes have been tested by running the updated notebook in a local development environment. The notebook was executed step-by-step to ensure that the synthetic data generation process works as expected with the new settings and dependencies.

Types of changes

Checklist: