sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

Extend metadata file with custom values representing missing data #751

Open MLjungg opened 2 years ago

MLjungg commented 2 years ago

Problem Description

Although it is bad practice, It is common to use a "placeholder value" to represent a missing value. For example, in a continuous column the value "999" or "-1" can be used to represent missing data.

The current implementation of SDV can handle null values, but it only identifies missing values by data.isnull() in the NullTransformer. Hence, the user needs to transform missing values such as the ones mentioned above before sending it to SDV.

Expected behavior

It would be convenient if SDV could extend its metadata file to allow custom null values per column. This data could later be used to transform the custom null values to np.nan before reaching the NullTransformer.

npatki commented 2 years ago

Thanks for the request @MLjungg. A few follow up questions that will help us prioritize the feature:

  1. In your experience, is there typically a singular placeholder value or can there be multiple values that mean different things? Eg 999 and -1 can both be present but have different meanings
  2. How are you planning to use the synthetic data? Is it your expectation that the same custom missing values will appear in the synthetic data too?