sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.38k stars 317 forks source link

What is the limit of the learning of CTGAN and CopulaGAN #219

Closed JulienGervai closed 4 years ago

JulienGervai commented 4 years ago

Description

Hello, thirst of all thanks for the great framework you are providing. I would like to know what are the limits of CTGAN and CopulaGAN. I mean, for example CopulaGAN will learn (or try to learn) the best distributions for the variables in the dataset. But what about the link between the variables that can be a lot more complexe ? To give an exemple, lets take a dataframe containing some people with their age, job, salary etc. CopulaGAN will learn that the age follow, lets say a Gaussian, and I think that I understand (with your explanations and some tests) that it will learn that given a job the salary can be higher (in general the salary of traders will certainly be higher than the one of nurses for example) But my question is at what point will it be difficult for the model to learn the links between some variables ? And a second question I have in mind is : have you tested CopulaGAN in term of machile learning like I saw you did with SDGYM ?

Thank you !

JulienGervai commented 4 years ago

To give some visualisation to my first post, here is a picture where we can see that the distribution is well learnt. It represent the charges given some information about some people. In that case that's the representation of all the dataset

Capture

But if we are now only interested in the smokers one. We will have :

Capture2

Where we can see that the distribution is not perfectly well learnt.

Maybe I am making some mistakes or maybe the model is not trained enough (500 epochs here). Please let me know if you can answer my questions.

JulienGervai commented 4 years ago

I am sorry, my example is not good beacause if I train the model more (2000 epochs) I got : vizSmoker