sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.33k stars 306 forks source link

ArrayMemoryError when using CTGAN (assuming numerical columns are discrete) #1433

Closed thaddywu closed 1 year ago

thaddywu commented 1 year ago

Environment Details

Error Description

I create a large table with 300,000 rows, and 1 categorical column. When using CTGAN and specifying LabelEncoder as the transformer, SDV attempts to allocate memory for a (300000, 300000) array, even when the transformation is done before fitting. SDV seems to use one-hot encoding for transformed categorical columns. From my perspective, SDV should take transformed categorical columns as floating numbers in this case. I'm wondering if I get the correct understanding. This issue is related to 1381.

Moreover, GuassianCopula works correctly.

Steps to reproduce

Code snippet

from sdv.metadata import SingleTableMetadata
from sdv.single_table import CTGANSynthesizer
from rdt.transformers import LabelEncoder
import pandas as pd

content = ["type" + str(i) for i in range(300000)]
df = pd.DataFrame({"data": content})
print(df.head(5))
metadata = SingleTableMetadata()
metadata.add_column("data", sdtype="categorical")
synthesizer = CTGANSynthesizer(metadata, verbose=True)
# synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.auto_assign_transformers(df)
synthesizer.update_transformers(column_name_to_transformer={
    'data': LabelEncoder(add_noise=True)
})
print(synthesizer.get_transformers())
processed_data = synthesizer.preprocess(df)
print(processed_data.head(5))
synthesizer.fit_processed_data(processed_data)

Console Output

    data
0  type0
1  type1
2  type2
3  type3
4  type4
/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/sdv/single_table/base.py:275: UserWarning: Replacing the default transformer for column 'data' might impact the quality of your synthetic data.
  warnings.warn(
{'data': LabelEncoder(add_noise=True)}
       data
0  0.135934
1  1.643127
2  2.649922
3  3.039663
4  4.955180

Error

joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
    r = call_item()
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 620, in __call__
    return self.func(*args, **kwargs)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/ctgan/data_transformer.py", line 129, in _transform_discrete
    return ohe.transform(data).to_numpy()
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/rdt/transformers/base.py", line 52, in wrapper
    return function(self, *args, **kwargs)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/rdt/transformers/base.py", line 367, in transform
    transformed_data = self._transform(columns_data)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/rdt/transformers/categorical.py", line 388, in _transform
    return self._transform_helper(data)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/rdt/transformers/categorical.py", line 356, in _transform_helper
    array = (coded == dummies).astype(int)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 671. GiB for an array with shape (300000, 300000) and data type int64
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/ssd/thaddywu/sdv/example.py", line 21, in <module>
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/sdv/single_table/base.py", line 457, in fit_processed_data
    self._fit(processed_data)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/sdv/single_table/ctgan.py", line 117, in _fit
    self._model.fit(processed_data, discrete_columns=discrete_columns)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/ctgan/synthesizers/base.py", line 50, in wrapper
    return function(self, *args, **kwargs)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/ctgan/synthesizers/ctgan.py", line 308, in fit
    train_data = self._transformer.transform(train_data)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/ctgan/data_transformer.py", line 179, in transform
    column_data_list = self._parallel_transform(
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/ctgan/data_transformer.py", line 163, in _parallel_transform
    return Parallel(n_jobs=-1)(processes)
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/mnt/ssd/thaddywu/.local/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 671. GiB for an array with shape (300000, 300000) and data type int64

Thank you! ;)

npatki commented 1 year ago

Hi @thaddywu, thanks for this detailed example. Very useful for us!

It seems that the processed_data is correctly using the assigned Label Encoder transformer. If I inspect it, I see that the column is now fully numerical, with floats.

image

The problem is that SDV continues to tell the underlying CTGAN model that the data is discrete (even though it's numerical now). So now it's treating each floating point value (eg. 0.709244) as a categorical value. Not what we want!

Let's keep this issue open until we track a fix for the bug.

Workaround

You can still get the desired outcome by using the RDT library outside of the SDV to do your own pre/post processing. Something like this:

from rdt import HyperTransformer
from rdt.transformers.categorical import LabelEncoder

# preprocess the data ourselves using the RDT library
ht = HyperTransformer()
ht.detect_initial_config(df)
ht.update_transformers(column_name_to_transformer={
    'data': LabelEncoder(add_noise=True)
})
use_df = ht.fit_transform(df) 

# SDV can just handle the processed, numerical ata
metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'data': { 'sdtype': 'numerical' }
    }
})

synthesizer = CTGANSynthesizer(metadata, epochs=1)
synthesizer.fit(use_df)
raw_synthetic_data = synthesizer.sample(num_rows=10)

# post process using the RDT
synthetic_data = ht.reverse_transform(raw_synthetic_data)

For the SDV Team

We use the method detect_discrete_columns to identify which columns are categorical, and then pass them along in CTGAN's fit function.

The detect_discrete_columns make some assumptions that are not true:

These assumptions are not true because if the user has assigned transformers themselves, then the output of those transformers may have changed which columns are discrete. Mainly:

Applying a categorical transformer would change a discrete column into a numerical column

I'm not sure how, but this should be factored in.

saswat0 commented 1 year ago

@npatki Could you please provide a workaround in your above comment for a more generalised case? Say there are n columns, m of which are numerical, p categorical and k of them are having this error

Using the above approach converts all column to numerical

npatki commented 1 year ago

Hi @saswat0, my workaround is to generally use the RDT library to do your own pre and post processing. RDT allows you to choose which transformers to apply to which columns.

You can browse details of the RDT library here: https://docs.sdv.dev/rdt

saswat0 commented 1 year ago

Got it! Thanks for resolving this @npatki

npatki commented 1 year ago

Hi everyone, it seems like there are multiple problems being discussed here so I've split it up into several issues.

  1. 1451, for general performance improvements to CTGAN

  2. 1450, for the bug we noticed (CTGAN modeling is still inefficient after adding transformers)

I'm marking this issue as a duplicate in favor of the above two. If you have more thoughts, please feel free to continue to conversations in either of the above open issues. Thanks!

jalr4ever commented 3 weeks ago

When I was using CTGAN, I also encountered memory issues (https://github.com/sdv-dev/SDV/issues/2189#issuecomment-2307299551 ) caused by categorical columns. I then updated its encoder to UniformEncoder (similar to GaussianCopula), but the result was the same as this issue; it still didn't work and memory usage remained high. This is really strange. In CTGAN, can data be processed from a Transformer perspective? For example, for types like Frequency, could we use numerical methods instead of distinguishing based on whether metadata is discrete or continuous?

npatki commented 3 weeks ago

Hi @jalr4ever, that's unexpected. Would you be able to file a new issue for this? In the new issue, it would be helpful if you could also provide the code you used to update the column(s) to UniformEncoder, that would be helpful.