hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.26k stars 541 forks source link

Segmentation Fault in CTGAN Execution, Resolved by Upgrading scikit-learn #208

Open MooooCat opened 1 month ago

MooooCat commented 1 month ago

Description

A segmentation fault error occurs when running the CTGAN model using the provided example code in the example/ directory with the SDG framework.

This error was resolved by upgrading the scikit-learn version from 1.4.3 to 1.5.1.

Reproduce

  1. Clone the SDG repository and navigate to the example/ directory.
  2. Ensure the xxx_training_data.csv file is present in the directory.
  3. Run the sdg_script.py script using Python 3.12 with the command: python -X faulthandler sdg_script.py.

Expected behavior

The script should run without errors and produce the following log output:

2024-07-30 08:52:24.002 | INFO     | sdgx.models.ml.single_table.ctgan:fit:221 - CTGAN training finished.
2024-07-30 08:52:24.002 | INFO     | sdgx.synthesizer:fit:324 - Model fit... Finished

Context

Error message
(base) mooocat@mooocatsiMac example % python -X faulthandler sdg_script.py
2024-07-29 22:42:12.776 | INFO     | sdgx.data_models.metadata:from_dataloader:289 - Inspecting metadata...
2024-07-29 22:42:12.852 | INFO     | sdgx.data_models.metadata:update_primary_key:491 - Primary 

... 

2024-07-29 22:42:13.097 | INFO     | sdgx.data_processors.transformers.column_order:convert:52 - Converting data using ColumnOrderTransformer... Finished (No action).
2024-07-29 22:42:13.098 | INFO     | sdgx.models.components.optimize.sdv_ctgan.data_transformer:fit:114 - Fitting continuous column ELEC_CUST_NO...
2024-07-29 22:42:13.102 | INFO     | sdgx.models.components.optimize.sdv_ctgan.data_transformer:_fit_continuous:57 - Fitting continues column ELEC_CUST_NO in <_fit_continuous>.
Fatal Python error: Fatal Python error: Segmentation faultSegmentation fault

Thread 0xThread 0x00007ff84764dfc000007ff84764dfc0 (most recent call first):
 (most recent call first):
  File   File ""/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py"", line , line 754754 in  in _kmeans_single_lloyd_kmeans_single_lloyd

  File   File ""/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py"", line , line 15361536zsh: segmentation fault  python -X faulthandler error_script.py

sdg_script.py

from sdgx.data_connectors.csv_connector import CsvConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata

from pathlib import Path
file_path = './xxx_training_data.csv'
path_obj = Path(file_path)

data_connector = CsvConnector(path=path_obj)
data_loader = DataLoader(data_connector)

loan_metadata = Metadata.from_dataloader(data_loader)

synthesizer = Synthesizer(
    metadata= loan_metadata,
    model=CTGANSynthesizerModel(epochs=2),
    data_connector=data_connector,
)

synthesizer.fit()

Wh1isper commented 1 month ago

The problem seems to be only on macos? I tried to pin the version at greater than 1.4.3, but python 3.8 doesn't support it.

Maybe we cloud pin this when python 3.8 end of life(2024-10)