sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.26k stars 285 forks source link

Not working with Discrete_columns containing integers #24

Closed oregonpillow closed 4 years ago

oregonpillow commented 4 years ago

Description

The definition of discrete columns is correct on the homepage; stating that discrete columns can indeed be integers or strings. However in practice I have not found the CTGANSynthesizer to work with discrete_columns that contain integers.

What I Did

Using the Census demo dataset, I looked at how many unique values there are for each column.

age 73 workclass 9 fnlwgt 21648 education 16 education-num 16 marital-status 7 occupation 15 relationship 6 race 5 sex 2 capital-gain 119 capital-loss 92 hours-per-week 94 native-country 42 income 2

With the except on 'fnlwgt' which is clearly continuous, it seems odd to me that integer columns like education-num, hours-per-week, capital-loss, capital-gain and even age are not added to discrete_columns too - as a very general rule, if a column contains less that say 5% unique values i'd see it's pretty likely to be discrete in most cases.

Regardless, if I list any integer column within discrete_columns i get errors. For example, if i add 'education-num' to the discrete_columns list i get this error:

ValueError: could not convert string to float: ' Never-married'

This is strange since the error is not associated with 'education-num' which I just added but with the 'marital-status' column.

Are there any examples of CTGAN working with discrete integer columns? It seems that the demo definition of discrete is any column containing strings.

csala commented 4 years ago

Hello @oregonpillow

On one side, I'm afraid I cannot reproduce the error that you mention when using integer discrete columns.

Here you have the log of an ipython session where I simply added the education-num column as you were suggesting and was able to fit and sample without issues:

In [1]: from ctgan import load_demo 
   ...:  
   ...: data = load_demo()                                                                                                            

In [2]: discrete_columns = [ 
   ...:     'education-num', 
   ...:     'workclass', 
   ...:     'education', 
   ...:     'marital-status', 
   ...:     'occupation', 
   ...:     'relationship', 
   ...:     'race', 
   ...:     'sex', 
   ...:     'native-country', 
   ...:     'income' 
   ...: ]                                                                                                                             

In [3]: from ctgan import CTGANSynthesizer 
   ...:  
   ...: ctgan = CTGANSynthesizer() 
   ...: ctgan.fit(data, discrete_columns, epochs=5)                                                                                   
Epoch 1, Loss G: 2.4979, Loss D: -0.2276
Epoch 2, Loss G: 1.9196, Loss D: -0.0148
Epoch 3, Loss G: 1.3274, Loss D: 0.0512
Epoch 4, Loss G: 0.7159, Loss D: -0.0321
Epoch 5, Loss G: 0.7812, Loss D: 0.0776

In [4]: ctgan.sample(5)                                                                                                               
Out[4]: 
       age          workclass  fnlwgt      education  ... capital-loss hours-per-week       native-country  income
0  45.7425            Private  183429        Masters  ...     -5.55031        32.2947                 Hong    >50K
1  34.1148   Self-emp-not-inc  195174        Masters  ...     -5.06892        48.4982   Dominican-Republic    >50K
2  48.1483            Private   81730     Assoc-acdm  ...      3.79141          52.36        United-States   <=50K
3  55.9176            Private  168474   Some-college  ...      -5.9418        47.9766              Germany    >50K
4  43.7323            Private  147161   Some-college  ...     -1.53701        64.9714        United-States    >50K

[5 rows x 15 columns]

Please try to reproduce this same code on your side to see if it also works for you. If it does, please let us know if you did something different, to see if there is truly an error in the CTGAN code.

On the other side, the logic behind identifying a column as discrete is not about the % of unique values withing the column, but rather about whether the column has any implicit order on its values. For example, if you have an integer that represents the age of a person in years, which is clearly a numerical variable, the maximum number of unique values that you will find will be around 100 at most, no matter how many data points you have. It's obvious, in this case, that the % of unique values depends only on the number of data points and that you can make it as low as you want just by making the data sample larger, and not on the data nature.

Or, giving another twist to it, you can think of a discrete column as a column for which differences between rows tend to be proportional to the differences between values. For example, the salary of a person who works 40 hours week will tend to be very similar to the salary of a person with the same attributes who works 41 hours a week, but very different to the one of a person who works 10 hours a week. But the same does not happen with something like, for example, the occupation: even if the occupations had been encoded as integer values (which you can sort), differences in salary would not be proportional to the differences between the integers that you used to encode the variable.

So, in other words, even though CTGAN would support to work with them as discrete, all the columns that you are mentioning are clearly numerical and should not be included in the discrete_columns list for optimal performance.

I hope this clarifies the doubts!

oregonpillow commented 4 years ago

Thanks @csala for your detailed response and sorry about slow reply. I knew i would get axed for thinking of discrete columns like i did :) haha. You're quite right though and thank you for your great examples. Particularly your second example was really helpful conceptually.

I tried re-creating the error. Something import i forgot to metion is that I've been using Google Colab, which I'm not sure makes a difference? I'm doing a fresh pip install of CTGAN before running.

I just tried again and the error is manifesting itself again. Here is a link if you want to try yourself: https://colab.research.google.com/drive/1DN3Yx8X2xcxrSWXpcXvJaXc5BBDHvvE-

[1] %%capture
!pip install ctgan

[2] !pip install ipython-autotime
%load_ext autotime

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

[3] data = load_demo()

[4] discrete_columns = [
    'education-num'                
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

[5] ctgan = CTGANSynthesizer()

[6] ctgan.fit(data, discrete_columns,epochs=5)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-fa0a7a3619b5> in <module>()
----> 1 ctgan.fit(data, discrete_columns,epochs=5)

8 frames
/usr/local/lib/python3.6/dist-packages/ctgan/synthesizer.py in fit(self, train_data, discrete_columns, epochs, log_frequency)
    116 
    117         self.transformer = DataTransformer()
--> 118         self.transformer.fit(train_data, discrete_columns)
    119         train_data = self.transformer.transform(train_data)
    120 

/usr/local/lib/python3.6/dist-packages/ctgan/transformer.py in fit(self, data, discrete_columns)
     73                 meta = self._fit_discrete(column, column_data)
     74             else:
---> 75                 meta = self._fit_continuous(column, column_data)
     76 
     77             self.output_info += meta['output_info']

/usr/local/lib/python3.6/dist-packages/sklearn/utils/_testing.py in wrapper(*args, **kwargs)
    325             with warnings.catch_warnings():
    326                 warnings.simplefilter("ignore", self.category)
--> 327                 return fn(*args, **kwargs)
    328 
    329         return wrapper

/usr/local/lib/python3.6/dist-packages/ctgan/transformer.py in _fit_continuous(self, column, data)
     33             n_init=1
     34         )
---> 35         gm.fit(data)
     36         components = gm.weights_ > self.epsilon
     37         num_components = components.sum()

/usr/local/lib/python3.6/dist-packages/sklearn/mixture/_base.py in fit(self, X, y)
    190         self
    191         """
--> 192         self.fit_predict(X, y)
    193         return self
    194 

/usr/local/lib/python3.6/dist-packages/sklearn/mixture/_base.py in fit_predict(self, X, y)
    217             Component labels.
    218         """
--> 219         X = _check_X(X, self.n_components, ensure_min_samples=2)
    220         self._check_initial_parameters(X)
    221 

/usr/local/lib/python3.6/dist-packages/sklearn/mixture/_base.py in _check_X(X, n_components, n_features, ensure_min_samples)
     51     """
     52     X = check_array(X, dtype=[np.float64, np.float32],
---> 53                     ensure_min_samples=ensure_min_samples)
     54     if n_components is not None and X.shape[0] < n_components:
     55         raise ValueError('Expected n_samples >= n_components '

/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: could not convert string to float: ' State-gov'
time: 5.55 s
oregonpillow commented 4 years ago

@csala , another observation i saw is that in the synthetic output of the demo data, the 'capital-loss' and 'capital-gain' columns have a lot of negative float values in after fitting since we treat them as continuous variables during the input

But in the original data the majority of the values in 'capital-loss' and 'capital-gain' are zeros, punctuated only occasionally by a a few large numbers.

Normally, fixing integer columns that get converted to floats during fitting are easy to correct outside of ctgan. But in this example, even if i round these negative floats, i'm still left with negative values which in the context of these columns do not make sense! In this example, it makes the synthetic data very easy to distinguish versus the original data.

original data:

Screen Shot 2020-02-02 at 1 18 56 PM

after fitting:

Screen Shot 2020-02-02 at 1 13 16 PM

Is there any solution for this you know of?

csala commented 4 years ago

I just tried again and the error is manifesting itself again. Here is a link if you want to try yourself: https://colab.research.google.com/drive/1DN3Yx8X2xcxrSWXpcXvJaXc5BBDHvvE-

Ok, I'm afraid that in this case the actual problem was just a typo :-)

Notice that after the 'education-num' string there is no comma, which makes python interpret that 'education-num' and 'workclass' are 2 parts of the same string:

[4] discrete_columns = [
    'education-num'                
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native-country',
    'income'
]

So, what is actually happening is that CTGAN is looking for a discrete column called 'education-numworkclass' and skipping the 'workclass' column, which is why raw workclass values end up being passed to the model.

I'll open a new issue to take care if and validate that the inputted column names are all valid instead of silently ignoring them.

oregonpillow commented 4 years ago

@csala Thanks for finding my embarrassing typo ;-) and thanks for explaining what's happening in the background with the error. That's good to know. I agree - if we could implement valid column checking at the very beginning of fitting I think this would be a good fix! and save others a lot of time in the future if they have lots of discrete columns they need to specify.

Did you get a chance to look at the other problem example I gave above with the capital-loss, capital-gain columns? The synthetic output of these columns should be all zeros punctuated only occasionally with some large numbers (like in the original dataset). However, the synthetic output has a bunch of negative values in.

Is there a solution for this? (or should i move this to a new issue?)

It seems that this is a problem with continuous variables that have many repeating numbers; the model gets the total distribution of capital-loss / capital-gain correct , but does so by substituting zeros with non-zero values (including negative values!)

Baukebrenninkmeijer commented 4 years ago

@oregonpillow This occurs because in the continuous columns, gaussians are being fitted to the distribution. A column with a lot of zeros will have a gaussian fit around 0, which will inherently result in negative values.

In my practices, I've been trying to detect the min and max for continuous columns in the metadata extraction (transformer.py) and clipping the resulting values to those. They limit these anomalies a bit, but do remove some of the probabilistic variation that occured with this technique. Previously, a synthetic data point could have a higher capital-gain than anyone in the real data. However, with this limitation that option is gone.

zadamg commented 4 years ago

Adding to this conversation - using CTGAN gives me negative numbers where they should not be (cost totals, video counts, etc.). I don't want to sound dumb but... how can I make this usable with all these negative numbers? The train data is naturally skewed with many 0s (truly, one of the reasons I need synthetic data)... it may be nice to add some sort of keyword arg to force positive outcomes somehow? I'd love to use this package but can't do the negatives.

image

csala commented 4 years ago

Hi @zgirson thanks for reporting this.

We are working on it from the SDV side: https://github.com/sdv-dev/SDV/issues/200

SDV already provides an interim workaround which is explained on the issue itself, so please give it a try and feel free to follow up there if you have any doubt.