Closed thaddywu closed 1 year ago
Hi @thaddywu, thanks for this detailed example. Very useful for us!
It seems that the processed_data
is correctly using the assigned Label Encoder transformer. If I inspect it, I see that the column is now fully numerical, with floats.
The problem is that SDV continues to tell the underlying CTGAN model that the data is discrete (even though it's numerical now). So now it's treating each floating point value (eg. 0.709244) as a categorical value. Not what we want!
Let's keep this issue open until we track a fix for the bug.
You can still get the desired outcome by using the RDT library outside of the SDV to do your own pre/post processing. Something like this:
from rdt import HyperTransformer
from rdt.transformers.categorical import LabelEncoder
# preprocess the data ourselves using the RDT library
ht = HyperTransformer()
ht.detect_initial_config(df)
ht.update_transformers(column_name_to_transformer={
'data': LabelEncoder(add_noise=True)
})
use_df = ht.fit_transform(df)
# SDV can just handle the processed, numerical ata
metadata = SingleTableMetadata.load_from_dict({
'columns': {
'data': { 'sdtype': 'numerical' }
}
})
synthesizer = CTGANSynthesizer(metadata, epochs=1)
synthesizer.fit(use_df)
raw_synthetic_data = synthesizer.sample(num_rows=10)
# post process using the RDT
synthetic_data = ht.reverse_transform(raw_synthetic_data)
We use the method detect_discrete_columns to identify which columns are categorical, and then pass them along in CTGAN's fit function.
The detect_discrete_columns
make some assumptions that are not true:
These assumptions are not true because if the user has assigned transformers themselves, then the output of those transformers may have changed which columns are discrete. Mainly:
Applying a categorical transformer would change a discrete column into a numerical column
I'm not sure how, but this should be factored in.
@npatki Could you please provide a workaround in your above comment for a more generalised case? Say there are n columns, m of which are numerical, p categorical and k of them are having this error
Using the above approach converts all column to numerical
Hi @saswat0, my workaround is to generally use the RDT library to do your own pre and post processing. RDT allows you to choose which transformers to apply to which columns.
You can browse details of the RDT library here: https://docs.sdv.dev/rdt
Got it! Thanks for resolving this @npatki
Hi everyone, it seems like there are multiple problems being discussed here so I've split it up into several issues.
I'm marking this issue as a duplicate in favor of the above two. If you have more thoughts, please feel free to continue to conversations in either of the above open issues. Thanks!
When I was using CTGAN, I also encountered memory issues (https://github.com/sdv-dev/SDV/issues/2189#issuecomment-2307299551 ) caused by categorical columns. I then updated its encoder to UniformEncoder (similar to GaussianCopula), but the result was the same as this issue; it still didn't work and memory usage remained high. This is really strange. In CTGAN, can data be processed from a Transformer perspective? For example, for types like Frequency, could we use numerical methods instead of distinguishing based on whether metadata is discrete or continuous?
Hi @jalr4ever, that's unexpected. Would you be able to file a new issue for this? In the new issue, it would be helpful if you could also provide the code you used to update the column(s) to UniformEncoder, that would be helpful.
Environment Details
Error Description
I create a large table with 300,000 rows, and 1 categorical column. When using CTGAN and specifying LabelEncoder as the transformer, SDV attempts to allocate memory for a (300000, 300000) array, even when the transformation is done before fitting. SDV seems to use one-hot encoding for transformed categorical columns. From my perspective, SDV should take transformed categorical columns as floating numbers in this case. I'm wondering if I get the correct understanding. This issue is related to 1381.
Moreover, GuassianCopula works correctly.
Steps to reproduce
Code snippet
Console Output
Error
Thank you! ;)