sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.31k stars 304 forks source link

I want the ID column length should match the given regex pattern #2229

Open Veeresh1996 opened 1 week ago

Veeresh1996 commented 1 week ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I am using HMAsynthesizer for Multitables. I am able to generate data with the trained model. But for the columns which I have mentioned as ID's the length of the generated values not matches with the real data even though I have specified the regex pattern. For example, One of the ID column contains 6 digits but the generated output contains some random lengths. Real Data ID Value: 300164 Generated value: 2690

What I already tried

This is the metadata for that specific field, "patnum": { "sdtype": "id", "regex_format": "^\d{6}$" } Could you please look into it ASAP? Please let me know if you need any other info

Thanks in advance

npatki commented 1 week ago

Hi @Veeresh1996, SDV is designed to ensure that the synthetic data matches (a) the regex format that you provide and (b) the original data type of the real data. In your case, it seems like the two are in conflict with each other: The regex describes having a 6-digit strings, but it appears to me the original data type is an integer.

The regex may correctly produce strings such as "002690" but when converted to an integer, this will become 2690 (no longer 6 characters). So the regex is not really compatible with the data type. To fix this issue, you would have to address root cause of the mismatch.

Veeresh1996 commented 1 week ago

Hey Neha, Thanks the solution that you have provided works for me.

  1. Is it possible to generate duplicate values in id columns? For example I want a six digit value and it is ok to have duplications of the value in same field.
  2. I have null values in one of my id column (which is not a primary key or foreign key but just a unique value), I just want to generate same kind of data with unique and null values in the respective field. How can I achieve that?