sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 292 forks source link

Support of pandas dtypes (needed for integers with missing values) #1154

Open nuldertien opened 1 year ago

nuldertien commented 1 year ago

Problem Description

I have a column in my dataset that has integers and nan values. The way I transform my columns currently, in order to deal with integers (no decimals) and nan values, is by transforming it to a 'Int64' dtype, more specifically; pd.Int64Dtype(). However after training a sdv model with this dtype it provides errors when I want to sample ("Cannot interpret 'Int64Dtype()' as a data type").

Expected behavior

Be able to support pandas dtypes such that I am able to train and sample on this kind of data.

Additional context

I transformed the column with .astype('Int64'), more specifically with round(pd.to_numeric(dataframe['column1'], errors='coerce')).astype('Int64'). Such that: {'column1':[123500,56832,]}, where the type() of each corresponds to [np.int64, np.int64, pandas._libs.missing.NAType]. The used metadata is provided below.

"fields": { "column1": { "type": "numerical", "subtype": "integer" }

npatki commented 1 year ago

Thanks for filing @nuldertien -- we'll keep this issue open for tracking purposes and communicating progress.

For anyone seeing this issue for the first time, here is a suggestion in the meantime:

The SDV is smart enough to recognize that all values in the column are whole numbers. So even if you leave the column as float64 for now, any decimals you see should always end in .0. While this not ideal in terms of data representation, it should hopefully still give you usable synthetic data.

ryantimjohn commented 1 month ago

I just got this same error so I'd like to point out this is an ongoing issue