sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 305 forks source link

Out of memory while fit #1381

Closed saswat0 closed 1 year ago

saswat0 commented 1 year ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I'm trying to generate some synthetic data using SDV's CTGAN. But the code terminates midway due to memory overflow. My dataset size is 195634 (rows) x 24 (columns) and my system has a 64 GiB memory capacity. While running fit, the memory overflows

What I already tried

from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(metadata, verbose=True)
synthesizer.fit(df)

synthetic_data = synthesizer.sample(num_rows=10)

This is my error

---------------------------------------------------------------------------
TerminatedWorkerError                     Traceback (most recent call last)
Cell In[37], line 4
      1 from sdv.single_table import CTGANSynthesizer
      3 synthesizer = CTGANSynthesizer(metadata, verbose=True)
----> 4 synthesizer.fit(df)
      6 synthetic_data = synthesizer.sample(num_rows=10)

File [~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:457](https://vscode-remote+ssh-002dremote-002bcompute-002e1689064242436753780.vscode-resource.vscode-cdn.net/home/synthetic_data/~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:457), in BaseSynthesizer.fit(self, data)
    455 self._random_state_set = False
    456 processed_data = self._preprocess(data)
--> 457 self.fit_processed_data(processed_data)

File [~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:441](https://vscode-remote+ssh-002dremote-002bcompute-002e1689064242436753780.vscode-resource.vscode-cdn.net/home/synthetic_data/~/anaconda3/envs/synthetic/lib/python3.8/site-packages/sdv/single_table/base.py:441), in BaseSynthesizer.fit_processed_data(self, processed_data)
    434 def fit_processed_data(self, processed_data):
    435     """Fit this model to the transformed data.
    436 
    437     Args:
    438         processed_data (pandas.DataFrame):
    439             The transformed data used to fit the model to.
    440     """
--> 441     self._fit(processed_data)
    442     self._fitted = True
    443     self._fitted_date = datetime.datetime.today().strftime('%Y-%m-%d')
...
    392         self = None

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

How to use SDV for datasets of this size then? Is there any provision for training the model incrementally on smaller subsets instead?

npatki commented 1 year ago

Hi @saswat0, I'm curious if you can speak more about your use case. What are you hoping to use the synthetic data for?

We can definitely look into why this is happening with CTGAN. But do note that the SDV offers multiple different synthesizers outside of just CTGAN. Depending on your use case, another synthesizer might be a better option and it would still allow you to use all of the SDV features such as constraints, conditional sampling, etc.

Resources

saswat0 commented 1 year ago

@npatki Thanks for the response

I have a dataset with sensitive (PII) information on customer data, and I wanted to synthesise a new dataset from this to make it public. The generated data must be drawn from the same distribution, and the new rows should be indistinguishable (since some ML models are trained and perform well on the real data, and this new data should also keep their results consistent).

I used GaussianCopulaSynthesizer as per your advice, and it gave reasonable results. But since the quality of generated data is of utmost concern, I'm leaning toward NN models rather than statistical ones.

saswat0 commented 1 year ago

I'm facing the same issue with TVAESynthesizer

npatki commented 1 year ago

Hi @saswat0, thanks for the details.

One thing you can try with CTGANSynthesizer is to preprocess all the categorical columns. You may want to try the LabelEncoder. This notebook has some more information.

from rdt.transformers.categorical import LabelEncoder

synthesizer = CTGANSynthesizer(metadata)
synthesizer.auto_assign_transformers(data)

synthesizer.update_transformers(column_name_to_transformer={
    'categorical_column_name': LabelEncoder(add_noise=True),
    'categorical_column_name_2': LabelEncoder(add_noise=True),
    ...
})

I'm not sure yet what effect this would have on the quality. Do let us know if you experiment with this!

I'd always recommend checking your quality using the SDMetrics quality report. (This blog post may be useful too.)

saswat0 commented 1 year ago

Hi @npatki I tried using this method as well but it failed. I had some success upon reducing the real dataset's size and moving to a bigger machine. Is there any provision to use a distributed setup (spark cluster with several nodes and shared memory) for this?

thaddywu commented 1 year ago

Hi @npatki , I have the same issue that SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer. Thank you! ;)

npatki commented 1 year ago

Hi @saswat0, I don't believe the CTGAN model is currently setup to make use of distributed infra. I see you filed CTGAN issue 290, which we can continue to keep open to discuss this.

SDV still seems to use one-hot encoding for categorical columns even specifying LabelEncoder as the transformer

@thaddywu this is unexpected! Do you have any code or examples to suggest that the LabelEncoder is being ignored and that one hot encoding is being used instead? I tried this with the demo data and it appears that the synthesizer is correctly preprocessing the data using label encoding. We can file a separate issue to look into this.

npatki commented 1 year ago

Hi everyone, I think this original discussion has been split into several different issues that are currently being tracked -- so I'm closing this off as a duplicate. Feel free to reply to any of the below issue based on your feedback.

See CTGAN #290 for multi GPU support.

See #1450, for issues when applying the LabelEncoder to CTGAN (this includes a workaround that you can use in the meantime)

See #1451, as an umbrella issue for performance improvements to CTGAN Synthesizer.