jamestgodwin / synthetic_data

MADWUD hackathon
4 stars 0 forks source link

CTGAN issue #3

Closed jamestgodwin closed 3 years ago

jamestgodwin commented 3 years ago

I've noticed @Bergam0t i_ds1.dataset.head() at the beginning produces a data frame that includes the target variable along with the final column Unnamed: 13.

Does your CTGAN functions deal with this as this may be the source of data leakage and the final column which we can ignore, which is causing the slightly weird results? The output is creating unwanted values for the Unnamed: 13 column which helped pick up the issue! Have you had a look at Mike's code?

https://github.com/jamestgodwin/synthetic_data_pilot/blob/main/01_wisconsin/03b_CTGAN_log_regression.ipynb

Hope this helps!

Bergam0t commented 3 years ago

That's a very good point about the final column - I hadn't twigged exactly what that was. Will drop that and rerun.

Have also now added a notebook which is just Mike's version tweaked very slightly to work with the new dataset.

I think I've now dealt with my leakage issue by making the test/train split of the real data occur when the SDVInputDataset object is initialised, so a consistent chunk of data is being used for generating all the models and the same test data (which hasn't been included when fitting) is being looking at throughout. However, that does only work for the metrics that have been informing my testing... I might need to look at making it more joined-up with classifier.ipynb.

And ref keeping the target column in, I think this is ok for the SDV package (https://sdv.dev/SDV/user_guides/single_table/ctgan.html) but I could be wrong...

jamestgodwin commented 3 years ago

I think you're right about the target column, I wasn't totally sure but what you've sent makes sense!

Don't worry about things being not too joined up with classifier.ipynb as all I need is one synthetic dataset you think is best to compare to the other generation methods!