Closed oregonpillow closed 4 years ago
Hello @oregonpillow
On one side, I'm afraid I cannot reproduce the error that you mention when using integer discrete columns.
Here you have the log of an ipython session where I simply added the education-num
column as you were suggesting and was able to fit and sample without issues:
In [1]: from ctgan import load_demo
...:
...: data = load_demo()
In [2]: discrete_columns = [
...: 'education-num',
...: 'workclass',
...: 'education',
...: 'marital-status',
...: 'occupation',
...: 'relationship',
...: 'race',
...: 'sex',
...: 'native-country',
...: 'income'
...: ]
In [3]: from ctgan import CTGANSynthesizer
...:
...: ctgan = CTGANSynthesizer()
...: ctgan.fit(data, discrete_columns, epochs=5)
Epoch 1, Loss G: 2.4979, Loss D: -0.2276
Epoch 2, Loss G: 1.9196, Loss D: -0.0148
Epoch 3, Loss G: 1.3274, Loss D: 0.0512
Epoch 4, Loss G: 0.7159, Loss D: -0.0321
Epoch 5, Loss G: 0.7812, Loss D: 0.0776
In [4]: ctgan.sample(5)
Out[4]:
age workclass fnlwgt education ... capital-loss hours-per-week native-country income
0 45.7425 Private 183429 Masters ... -5.55031 32.2947 Hong >50K
1 34.1148 Self-emp-not-inc 195174 Masters ... -5.06892 48.4982 Dominican-Republic >50K
2 48.1483 Private 81730 Assoc-acdm ... 3.79141 52.36 United-States <=50K
3 55.9176 Private 168474 Some-college ... -5.9418 47.9766 Germany >50K
4 43.7323 Private 147161 Some-college ... -1.53701 64.9714 United-States >50K
[5 rows x 15 columns]
Please try to reproduce this same code on your side to see if it also works for you. If it does, please let us know if you did something different, to see if there is truly an error in the CTGAN code.
On the other side, the logic behind identifying a column as discrete is not about the % of unique values withing the column, but rather about whether the column has any implicit order on its values. For example, if you have an integer that represents the age of a person in years, which is clearly a numerical variable, the maximum number of unique values that you will find will be around 100 at most, no matter how many data points you have. It's obvious, in this case, that the % of unique values depends only on the number of data points and that you can make it as low as you want just by making the data sample larger, and not on the data nature.
Or, giving another twist to it, you can think of a discrete column as a column for which differences between rows tend to be proportional to the differences between values. For example, the salary of a person who works 40 hours week will tend to be very similar to the salary of a person with the same attributes who works 41 hours a week, but very different to the one of a person who works 10 hours a week. But the same does not happen with something like, for example, the occupation: even if the occupations had been encoded as integer values (which you can sort), differences in salary would not be proportional to the differences between the integers that you used to encode the variable.
So, in other words, even though CTGAN would support to work with them as discrete, all the columns that you are mentioning are clearly numerical and should not be included in the discrete_columns
list for optimal performance.
I hope this clarifies the doubts!
Thanks @csala for your detailed response and sorry about slow reply. I knew i would get axed for thinking of discrete columns like i did :) haha. You're quite right though and thank you for your great examples. Particularly your second example was really helpful conceptually.
I tried re-creating the error. Something import i forgot to metion is that I've been using Google Colab, which I'm not sure makes a difference? I'm doing a fresh pip install of CTGAN before running.
I just tried again and the error is manifesting itself again. Here is a link if you want to try yourself: https://colab.research.google.com/drive/1DN3Yx8X2xcxrSWXpcXvJaXc5BBDHvvE-
[1] %%capture
!pip install ctgan
[2] !pip install ipython-autotime
%load_ext autotime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
[3] data = load_demo()
[4] discrete_columns = [
'education-num'
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country',
'income'
]
[5] ctgan = CTGANSynthesizer()
[6] ctgan.fit(data, discrete_columns,epochs=5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-fa0a7a3619b5> in <module>()
----> 1 ctgan.fit(data, discrete_columns,epochs=5)
8 frames
/usr/local/lib/python3.6/dist-packages/ctgan/synthesizer.py in fit(self, train_data, discrete_columns, epochs, log_frequency)
116
117 self.transformer = DataTransformer()
--> 118 self.transformer.fit(train_data, discrete_columns)
119 train_data = self.transformer.transform(train_data)
120
/usr/local/lib/python3.6/dist-packages/ctgan/transformer.py in fit(self, data, discrete_columns)
73 meta = self._fit_discrete(column, column_data)
74 else:
---> 75 meta = self._fit_continuous(column, column_data)
76
77 self.output_info += meta['output_info']
/usr/local/lib/python3.6/dist-packages/sklearn/utils/_testing.py in wrapper(*args, **kwargs)
325 with warnings.catch_warnings():
326 warnings.simplefilter("ignore", self.category)
--> 327 return fn(*args, **kwargs)
328
329 return wrapper
/usr/local/lib/python3.6/dist-packages/ctgan/transformer.py in _fit_continuous(self, column, data)
33 n_init=1
34 )
---> 35 gm.fit(data)
36 components = gm.weights_ > self.epsilon
37 num_components = components.sum()
/usr/local/lib/python3.6/dist-packages/sklearn/mixture/_base.py in fit(self, X, y)
190 self
191 """
--> 192 self.fit_predict(X, y)
193 return self
194
/usr/local/lib/python3.6/dist-packages/sklearn/mixture/_base.py in fit_predict(self, X, y)
217 Component labels.
218 """
--> 219 X = _check_X(X, self.n_components, ensure_min_samples=2)
220 self._check_initial_parameters(X)
221
/usr/local/lib/python3.6/dist-packages/sklearn/mixture/_base.py in _check_X(X, n_components, n_features, ensure_min_samples)
51 """
52 X = check_array(X, dtype=[np.float64, np.float32],
---> 53 ensure_min_samples=ensure_min_samples)
54 if n_components is not None and X.shape[0] < n_components:
55 raise ValueError('Expected n_samples >= n_components '
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", copy=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dtype)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: ' State-gov'
time: 5.55 s
@csala , another observation i saw is that in the synthetic output of the demo data, the 'capital-loss' and 'capital-gain' columns have a lot of negative float values in after fitting since we treat them as continuous variables during the input
But in the original data the majority of the values in 'capital-loss' and 'capital-gain' are zeros, punctuated only occasionally by a a few large numbers.
Normally, fixing integer columns that get converted to floats during fitting are easy to correct outside of ctgan. But in this example, even if i round these negative floats, i'm still left with negative values which in the context of these columns do not make sense! In this example, it makes the synthetic data very easy to distinguish versus the original data.
original data:
after fitting:
Is there any solution for this you know of?
I just tried again and the error is manifesting itself again. Here is a link if you want to try yourself: https://colab.research.google.com/drive/1DN3Yx8X2xcxrSWXpcXvJaXc5BBDHvvE-
Ok, I'm afraid that in this case the actual problem was just a typo :-)
Notice that after the 'education-num'
string there is no comma, which makes python interpret that
'education-num'
and 'workclass'
are 2 parts of the same string:
[4] discrete_columns = [ 'education-num' 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income' ]
So, what is actually happening is that CTGAN is looking for a discrete column called 'education-numworkclass'
and skipping the 'workclass'
column, which is why raw workclass
values end up being passed to the model.
I'll open a new issue to take care if and validate that the inputted column names are all valid instead of silently ignoring them.
@csala Thanks for finding my embarrassing typo ;-) and thanks for explaining what's happening in the background with the error. That's good to know. I agree - if we could implement valid column checking at the very beginning of fitting I think this would be a good fix! and save others a lot of time in the future if they have lots of discrete columns they need to specify.
Did you get a chance to look at the other problem example I gave above with the capital-loss, capital-gain columns? The synthetic output of these columns should be all zeros punctuated only occasionally with some large numbers (like in the original dataset). However, the synthetic output has a bunch of negative values in.
Is there a solution for this? (or should i move this to a new issue?)
It seems that this is a problem with continuous variables that have many repeating numbers; the model gets the total distribution of capital-loss / capital-gain correct , but does so by substituting zeros with non-zero values (including negative values!)
@oregonpillow This occurs because in the continuous columns, gaussians are being fitted to the distribution. A column with a lot of zeros will have a gaussian fit around 0, which will inherently result in negative values.
In my practices, I've been trying to detect the min and max for continuous columns in the metadata extraction (transformer.py) and clipping the resulting values to those. They limit these anomalies a bit, but do remove some of the probabilistic variation that occured with this technique. Previously, a synthetic data point could have a higher capital-gain than anyone in the real data. However, with this limitation that option is gone.
Adding to this conversation - using CTGAN gives me negative numbers where they should not be (cost totals, video counts, etc.). I don't want to sound dumb but... how can I make this usable with all these negative numbers? The train data is naturally skewed with many 0s (truly, one of the reasons I need synthetic data)... it may be nice to add some sort of keyword arg to force positive outcomes somehow? I'd love to use this package but can't do the negatives.
Hi @zgirson thanks for reporting this.
We are working on it from the SDV side: https://github.com/sdv-dev/SDV/issues/200
SDV already provides an interim workaround which is explained on the issue itself, so please give it a try and feel free to follow up there if you have any doubt.
Description
The definition of discrete columns is correct on the homepage; stating that discrete columns can indeed be integers or strings. However in practice I have not found the
CTGANSynthesizer
to work withdiscrete_columns
that contain integers.What I Did
Using the Census demo dataset, I looked at how many unique values there are for each column.
age 73 workclass 9 fnlwgt 21648 education 16 education-num 16 marital-status 7 occupation 15 relationship 6 race 5 sex 2 capital-gain 119 capital-loss 92 hours-per-week 94 native-country 42 income 2
With the except on 'fnlwgt' which is clearly continuous, it seems odd to me that integer columns like education-num, hours-per-week, capital-loss, capital-gain and even age are not added to
discrete_columns
too - as a very general rule, if a column contains less that say 5% unique values i'd see it's pretty likely to be discrete in most cases.Regardless, if I list any integer column within
discrete_columns
i get errors. For example, if i add 'education-num' to thediscrete_columns
list i get this error:This is strange since the error is not associated with 'education-num' which I just added but with the 'marital-status' column.
Are there any examples of CTGAN working with discrete integer columns? It seems that the demo definition of discrete is any column containing strings.