Closed jamestgodwin closed 3 years ago
That's a very good point about the final column - I hadn't twigged exactly what that was. Will drop that and rerun.
Have also now added a notebook which is just Mike's version tweaked very slightly to work with the new dataset.
I think I've now dealt with my leakage issue by making the test/train split of the real data occur when the SDVInputDataset object is initialised, so a consistent chunk of data is being used for generating all the models and the same test data (which hasn't been included when fitting) is being looking at throughout. However, that does only work for the metrics that have been informing my testing... I might need to look at making it more joined-up with classifier.ipynb.
And ref keeping the target column in, I think this is ok for the SDV package (https://sdv.dev/SDV/user_guides/single_table/ctgan.html) but I could be wrong...
I think you're right about the target column, I wasn't totally sure but what you've sent makes sense!
Don't worry about things being not too joined up with classifier.ipynb as all I need is one synthetic dataset you think is best to compare to the other generation methods!
I've noticed @Bergam0t
i_ds1.dataset.head()
at the beginning produces a data frame that includes the target variable along with the final column Unnamed: 13.Does your CTGAN functions deal with this as this may be the source of data leakage and the final column which we can ignore, which is causing the slightly weird results? The output is creating unwanted values for the Unnamed: 13 column which helped pick up the issue! Have you had a look at Mike's code?
https://github.com/jamestgodwin/synthetic_data_pilot/blob/main/01_wisconsin/03b_CTGAN_log_regression.ipynb
Hope this helps!