sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.34k stars 310 forks source link

Implementing GAN modeler #105

Closed zuberek closed 3 years ago

zuberek commented 5 years ago

Description

I would be very interested in using modern GANs, for example yours TGAN as a modeler. TGAN is able to handle both categorical and continuous data so it seems like a good idea. Is there any straight way of utilizing it or updating SDV to handle it?

I would be interested in doing it myself but I'm afraid I would need some guidance. SDV is built on the CPA idea which I imagine is not applicable in GAN scenario. Would generating the fields itself make sense? Or distribution and covariance can still be used.

Really sorry if the questions are completely not applicable to the repo

csala commented 5 years ago

Hi @zuberek

Adding GAN modeling is definitely one of the biggest issues on our route plan right now. But, as you mentioned, CPA does not seem the optimal approach when it comes to GANs so we are right now in the middle of (offline) discussions about an alternative approach, as this will for sure be a major development and needs to be carefully thought through before jumping into coding.

Knowing that there is further interest on this and that you'd like to get involved, we'll try to prioritize it and to provide feedback about these offline discussions here soon in order make them more visible and allow you and possibly other people to contribute to them.

So, thanks for opening the issue, and welcome on board!

zuberek commented 5 years ago

Hi @csala, sorry, I sent you an email on the address you have on your github profile but it seems like I didn't reach you, how can I contact you?

zuberek commented 4 years ago

hiya, I am still going back to the problem.

I believe a conditional GAN (when conditions are attached to the noise fed into the Generator and whole GAN is trained with them) is able to capture those needed cross table relations. Each table would have a diffrent GAN model. After training, when tables are sequentially synthesized, values from the first table can be fed as conditions for the generation of the next table.

It would be a longer process to train all the models but I think it might work. Also it would be having all the issues of normal GANs but it's a first step after which CTGAN or similar can be integrated. I'm trying it out on my own. Would you be able to give me some feedback/be interested in together implementing the idea here?

csala commented 3 years ago

This was already implemented with the introduction of CTGAN and derivatives, so closing this for now.