sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.37k stars 316 forks source link

Generate synthetic data for the selected columns #1539

Closed stackprep9 closed 1 year ago

stackprep9 commented 1 year ago

Environment details

SDV version : 1.2.1 python version : 3.8.8 Operating System : MACOS

General question:

Can we generate the synthetic data for only the selected columns and will get the synthetic data for that columns only and the remain columns will be same as the input data?

If so , can you share any reference docs for that

majidliaquat commented 1 year ago

Hi @stackprep9 ,

What if you try to split your data and just pass that column for to generate Synthetic Data.

Please look the Collab link for demo SDV Question 1539

If this is not what you asked for Let me know.

Thanks, Majid

npatki commented 1 year ago

Hi @stackprep9, nice to meet you. If I understand correctly, there are a few columns that you want to keep as-is and you want to create new, synthetic values to the remaining columns? Could you help me better understand the usage (or a few more details about your project)?

The way to do this is:

  1. Create a synthesizer and train it on the full dataset. Such a synthesizer will learn patterns between all columns of your dataset.
  2. After you train this synthesizer, use the sample_remaining_columns method. You can pass in the columns you already know (from the real data or elsewhere). The SDV Synthesizer will use the learned patterns to predict what the remaining columns should be.

FYI @majidliaquat, I took a look at the Colab notebook. Unfortunately, I'm not sure if this approach would work.

Let me know if you have any questions!

majidliaquat commented 1 year ago

Hi @npatki yes I got it they generated data was not corelated as only we generte a apart of data. But I followed that method as I understood the question from @stackprep9.

Can we generate the synthetic data for only the selected columns and will get the synthetic data for that columns only and the remain columns will be same as the input data?

Thanks for clearification.

npatki commented 1 year ago

No problem! Thanks for sharing the notebook too. Very useful to see what you're experimenting with :)

npatki commented 1 year ago

Hi everyone, since this issue has been inactive for a while and we've provided some suggestions, I'm closing this off as answered. If there's anything more to discuss, feel free to reply and I can always reopen the issue.