sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.23k stars 293 forks source link

Passing integer columns gives error in numerical_formatter #1335

Closed Scherzan closed 1 year ago

Scherzan commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

Fitting CTGANSynthesizer on data containing columns with numerical column type 'Int64' throws error in numerical_formatter.py on code roundable_data = data[~(np.isinf(data) | pd.isna(data))] (line 57). Error message gives TypeError (details below). I would expect support for data formatted with type 'Int64', as the documentation states support for 'Int64' and it is an integer-format that allows for NaN-values. Converting columns to data.astype('int64') throws IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer. Using data.astype('float') works fine.

Steps to reproduce

Run code below with sdv-beta installed.

data = pd.DataFrame({'Categoricalvalues' : ['John','Deep','Julia','Kate','Sandy'], 
                     'Integervalues' : [25,30,np.nan,40,45],
                     'Floatvalues': [1.0, 2.0, 5.0, 3.0, 9.0]})
print(data.dtypes)
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=data)

data['Integervalues'] = data['Integervalues'].astype('Int64') #int64, float
metadata.update_column(
column_name='Integervalues',
sdtype='numerical',
computer_representation='Int64')

metadata.update_column(
column_name='Floatvalues',
sdtype='numerical',
computer_representation='Float')

model = CTGANSynthesizer(metadata)
model.fit(data)

Error:
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[37], line 21
     15 metadata.update_column(
     16 column_name='Floatvalues',
     17 sdtype='numerical',
     18 computer_representation='Float')
     20 model = CTGANSynthesizer(metadata)
---> 21 model.fit(data)

File ~/anaconda3/envs/pycon_demo/lib/python3.10/site-packages/sdv/single_table/base.py:456, in BaseSynthesizer.fit(self, data)
    454 self._data_processor.reset_sampling()
    455 self._random_state_set = False
--> 456 processed_data = self._preprocess(data)
    457 self.fit_processed_data(processed_data)

File ~/anaconda3/envs/pycon_demo/lib/python3.10/site-packages/sdv/single_table/base.py:403, in BaseSynthesizer._preprocess(self, data)
    401 def _preprocess(self, data):
    402     self.validate(data)
--> 403     self._data_processor.fit(data)
    404     return self._data_processor.transform(data)

File ~/anaconda3/envs/pycon_demo/lib/python3.10/site-packages/sdv/data_processing/data_processor.py:603, in DataProcessor.fit(self, data)
    596 """Fit this metadata to the given data.
    597 
...
---> 57 roundable_data = data[~(np.isinf(data) | pd.isna(data))]
     59 # Doesn't contain numbers
     60 if len(roundable_data) == 0:

TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
npatki commented 1 year ago

Hi @Scherzan, nice to meet you and thanks for filing the issue with the detailed information.

I believe issue may be a dupe of #1154, as we do not support the pandas.Int64 type.

As long as you specify Int64 in the metadata, you should not need to manually convert the dataframe yourself. The SDV will know that your the values should be whole numbers represented by 64 bits.

Note that the computer_representation parameter in the metadata, does not refer to pandas dtypes. It refers to how many bits are being used to store the data, to ensure that there are no overflow errors.

Scherzan commented 1 year ago

Hi @npatki, it's great to meet you too! Thank you for your friendly response. Hopefully next time I won't miss to check open issues thoroughly enough. Thank you for taking the time to respond so quickly and helpfull. Have a wonderful day!

npatki commented 1 year ago

Hi @Scherzan no problem at all! Always here to help or clarify any Qs you may have. Let us know if you run into any other issues.

FYI you can also join our Slack Community and post there if that's easier.