sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 304 forks source link

Adjustable Target Feature Distribution #2039

Closed fatihcihant closed 3 months ago

fatihcihant commented 4 months ago

Problem description im using sdv library for synthetic data generation with recommender system datasets(avazu,criteo). my target feature is 'label'. 'label' distribution on synthetic data is same on real data , but in some algorithms doesnt match (TVAESynthesizer)i wanna make adjustable it. for example on real dataset target feature rate for '1' is 0.30, i wanna adjust to 0.45 on synthetic dataset. When i check docs couldnt find any parameters for doing that.

srinify commented 3 months ago

Hi @fatihcihant just to make sure I'm understanding your use case, I have a few quick questions!

label distribution on synthetic data is same on real data , but in some algorithms doesnt match

Are you saying that the distribution of values for the label column is very similar when comparing real data and synthetic data, except for when using TVAE Synthesizer?

One thing to double check is if if this column's metadata is set to the categorical sdtype or if it's numerical. Categorical is the way to go here because SDV will try to retain the same distribution of categories in the synthetic data for this column!

for example on real dataset target feature rate for '1' is 0.30, i wanna adjust to 0.45 on synthetic dataset.

Are you saying that the value 1 occurs 30% of the time in synthetic data but you want it to be 0.45%? Generally, SDV tries to model & learn the patterns in your real data. So if you want 45% (or 0.45) of the values in this column to be 1, I would recommend boosting your real data so that pattern is present for SDV to learn.

Alternatively, if you have very specific values for some rows you want in your synthetic data and you want SDV to "fill in the rest" -- you can explore using conditional sampling.

Also, as a side note, SDV is tested and designed to learn from raw, enterprise data, not data that's already heavily post-processed, featurized, etc. I just wanted to share that bit of design philosophy with you as well!

Let me know if these suggestions help or if I misunderstood your use case!

srinify commented 3 months ago

Hi there @fatihcihant I haven't heard from you in a while so I'm going to move forward with closing this issue out. Feel free to comment here and tag me or open a new issue if you still have more questions here.