sdv-dev / SDGym

Benchmarking synthetic data generation methods.
Other
262 stars 61 forks source link

Error when run custom model using benchmark_single_table #327

Open T0217 opened 3 months ago

T0217 commented 3 months ago

Environment Details

Error Description

When running the same code as #321 , the following error was encountered.

image

Steps to reproduce

import os
import shutil
import sdgym
from sdgym import create_single_table_synthesizer
from sdgym.synthesizers import (UniformSynthesizer,
                                GaussianCopulaSynthesizer,
                                TVAESynthesizer)
import warnings
warnings.filterwarnings('ignore')

synthesizers = [
    UniformSynthesizer,
    GaussianCopulaSynthesizer,
    TVAESynthesizer
]

# YData
# CTGAN
def ctgan_get_trained_synthesizer(data, metadata):
    from ydata_synthetic.synthesizers.regular import RegularSynthesizer
    from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

    ctgan_args = ModelParameters(batch_size=500, lr=2e-4, betas=(0.5, 0.9))
    train_args = TrainParameters(epochs=2)

    synthesizer = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)

    num_cols = [col for col, sdtype in metadata['columns'].items() if sdtype['sdtype'] in ['numerical', 'datetime']]
    cat_cols = [col for col, sdtype in metadata['columns'].items() if sdtype['sdtype'] == 'categorical']

    synthesizer.fit(data=data,
                    train_arguments=train_args,
                    num_cols=num_cols,
                    cat_cols=cat_cols)

    return synthesizer

def sample_from_synthesizer(synthesizer, n_rows):
    synthetic_data = synthesizer.sample(n_rows)
    return synthetic_data

YData_CTGANSynthesizer = create_single_table_synthesizer(
    get_trained_synthesizer_fn=ctgan_get_trained_synthesizer,
    sample_from_synthesizer_fn=sample_from_synthesizer,
    display_name='YData-CTGAN'
)

custom_synthesizers = [YData_CTGANSynthesizer]

# Detect the existence of the folder
detailed_results_folder = r"C:\Users\18840\Desktop\result"

if os.path.isdir(detailed_results_folder) and\
   os.path.exists(detailed_results_folder):
    print('The folder where the intermediate files are stored already exists and is processed for deletion.')
    shutil.rmtree(detailed_results_folder, ignore_errors=True)
    print('-' * 50)

results = sdgym.benchmark_single_table(
    synthesizers=synthesizers,
    custom_synthesizers=custom_synthesizers,
    show_progress=True,
    multi_processing_config={
     'package_name': 'multiprocessing',
     'num_workers': 8
    },
    sdv_datasets=['adult'],
    detailed_results_folder=detailed_results_folder
)
srinify commented 2 months ago

Hi there @T0217 👋 Do you mind updating SDGym and related libraries in our ecosystem to see if you're still running into this issue? We released some changes, so I'm always curious to validate if it's still relevant!

Second -- this is a bit challenging for us to debug because we aren't authors of Custom:YData-CTGAN etc. I'm curious if you were able to figure out the source of your error since posting this issue?

T0217 commented 2 months ago

Thanks for the feedback. I've updated SDGym to test it out. The TypeError issue with the Ydata CTGAN model, caused by weak references, persists. This is likely due to certain attributes or components within the model that use weak references. Switching from pickle to dill for serialization, as suggested in #328, or using the model from the SDV library, can resolve this problem. However, the issue mentioned in #321 remains unresolved, regardless of whether the model from SDV or Ydata is used.