Data generation for imbalanced dataset.

sdv-dev / SDV

Synthetic data generation for tabular data

https://docs.sdv.dev/sdv

Other

2.28k stars 300 forks source link

Data generation for imbalanced dataset. #1339

Closed Sanchita333 closed 1 year ago

Sanchita333 commented 1 year ago

Hi, we are generating synthetic tabular data for imbalanced dataset using SDV. We are using conditional sampling to generate data for minority class and later, we appending generated data and original data to balance the minority class : new_data = model.sample(num_rows=560,conditions={'Response': 1})

We have also used upsampling techniques like SMOTE and compared the results. We observed that the classification report for data using upsampling techniques is better as compared to that of SDV generate data.

I have few questions 1.Upsampling techniques are quite simpler and giving good results then why we use synthetic data ? 2.Is there any other way to improve the generated data using SDV ? 3.any other way to evaluate the results?

npatki commented 1 year ago

Hi @Sanchita333, nice to meet you.

The ability of synthetic data to improve label balancing is highly dependent on the details of your particular dataset and the methods that you are using to generate synthetic data. You may find this blog post relevant to your project.

As for improving the quality of synthetic data, I think this also depends on on which model you are using and how it performs when you look at the different columns. The SDMetrics library provides has some great resources for helping you evaluate real vs. synthetic data -- by generating reports, visualizing the data, applying metrics, etc.

If you can describe more about your dataset and project setup, we may be able to provide more guidance. For example, what kinds of methods are you using to do the final classifications?

Sanchita333 commented 1 year ago

Hi, I am using https://www.kaggle.com/code/nageshsingh/modeling-imbalanced-insurance-data/input dataset. Below is the comparison of both the techniques(Synthetic data generation and SMOTE).

Case1(Imbalanced dataset):

Case2 Using SDV(CTGAN):

Case3(Using SMOTE):

npatki commented 1 year ago

Hi @Sanchita333 this is interesting! Indeed, both SMOTE and synthetic data may be useful for this case.

Is there a reason why you are using CTGAN? It's hard to know exactly what's going on inside of a GAN -- and I wonder if it's over or under-fitting certain values. If you wish to explore further, I'd encourage the following:

Try to vary the # of epochs in CTGAN
Try a parametric model such as GaussianCopula

npatki commented 1 year ago

Hi @Sanchita333, I'm closing this issue off as it's been a few weeks since we last discussed the question. Please feel free to reply if there are more follow ups -- we can reopen the issue to continue.