sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.23k stars 279 forks source link

[HELP] CTGAN has Reproducibility? #380

Closed limhasic closed 3 months ago

limhasic commented 4 months ago

Environment details

If you are already running CTGAN, please indicate the following details about the environment in which you are running it:

Problem description

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)

ctgan.fit(real_data, discrete_columns)

# set seed
seed = 42

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) 

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

SEED_VALUE = 42

np.random.seed(SEED_VALUE)
torch.manual_seed(SEED_VALUE)

# Create synthetic data
#ctgan.set_random_state(123)
synthetic_data1 = ctgan.sample(1000)
#ctgan.set_random_state(123)
synthetic_data2 = ctgan.sample(1000)
# ctgan.set_random_state(123) 

# synthetic_data1 & synthetic_data2 comparison
if np.array_equal(synthetic_data1, synthetic_data2):
    print("synthetic_data1 & synthetic_data2 is equal.")
else:
    print("synthetic_data1 & synthetic_data2 is not equal.")

i tried this thousand times but .. still synthetic_data1 & synthetic_data2 is not equal.

image

srinify commented 3 months ago

Hi there @limhasic I'm not able to reproduce this. With both 1 and 10 epochs, I was able to generate the same exact data from 2 different CTGAN models.

from ctgan import CTGAN
from ctgan import load_demo

real_data = load_demo()

# Names of the columns that are discrete
discrete_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

ctgan = CTGAN(epochs=1, verbose = True)
ctgan.set_random_state(123)
ctgan.fit(real_data, discrete_columns)

ctgan2 = CTGAN(epochs=1, verbose = True)
ctgan2.set_random_state(123)
ctgan2.fit(real_data, discrete_columns)

a = ctgan.sample(100)
b = ctgan2.sample(100)

a.equals(b)

^ The last line returns True and you can also visually inspect and see that the datasets are the same.

limhasic commented 3 months ago

Is it possible to share the environment? Damn I got false again

i have ran on

python 3.8.10
ctgan 0.9.1
numpy 1.24.4
torch  1.10.1+cu111 
ubuntu 20.04...
srinify commented 3 months ago

I ran my code in Google Colab: https://colab.research.google.com/

Python 3.10.12
ctgan 0.10.0
numpy 1.25.2
torch 2.2.1
Ubuntu 18.04.3 LTS (I believe, based on what Google said for Colab)

A few things to consider:

srinify commented 3 months ago

@limhasic after some more investigation, it turns out we actually don't support reproducibility when fitting a synthesizer. The reproducibility we do support right now is only during sampling (generating 2 samples from the same synthesizer with the same random state).

Out of curiosity, what's the motivation to have reproducibility during model fitting itself?

limhasic commented 3 months ago

@srinify I am working on synthetic data.

Therefore, there is a lot of interest in evaluation indicators and generation methods between original data and synthetic data.

However, when generating data with CTGAN for evaluation, different results were obtained each time.

Since the sample did not show reproducibility, I started thinking about seed control for fitting.

Since it is still morning, I will test it in the Colab environment you sent.

also,

  1. Have you tried this with SDV's CTGANSynthesizer instead of using CTGAN directly? -> I tried both while changing environments.

  2. When you inspect both dataframes, where are the differences? Specific rows? Specific column? Number of rows? Etc -> First of all, I think it is different if the specific rows are different.

limhasic commented 3 months ago

Close by checking sampling reproducibility in the latest version of CTGANSynthesizer.

limhasic commented 3 months ago

Reproducibility is visible in simple data, but when the number of columns increases to more than 25, reproducibility is lost. When I wake up, I observe the phenomenon of the generator emitting different data.

srinify commented 3 months ago

Thanks for sharing context into your use case @limhasic I've opened this feature request to add reproducibility at the model fitting level with your use case: https://github.com/sdv-dev/SDV/issues/2022

DataCebo is a very small team and we use community interest to help us prioritize what to work on! So we hope more people will add their use cases to that issue over time.

Closing this issue out as software is working as intended right now.