No label column ('default payment next month') while synthesizing data using diffuser model

sattarov / FinDiff

Implementation of the paper: "FinDiff: Diffusion Models for Financial Tabular Data Generation"

19 stars 2 forks source link

No label column ('default payment next month') while synthesizing data using diffuser model #4

Open pnimeesha opened 1 month ago

pnimeesha commented 1 month ago

Hi,

I ran the sample code from the google colab here . The samples generated from the diffusion model do not have labels. Considering the credit card data in this case (as used in colab code), the label column refers to 'default payment next month'. So how can I run the Machine Learning efficacy evaluation metrics (referred to as utility in the paper for which code not available in colab) when the models you mentioned for the evaluation are supervised models (Random Forest, Decision Trees, Logistic Regression, Ada Boost, and Naive Bayes.). I wrote code for Utility and tried to test it. I realised label column is missing for synthetic data. Can you please let me know how this can be done without labels in the synthetic data?

Thanks in advance!

sattarov commented 1 month ago

Hi,

The labels are part of the training and sampling process, as shown in the google colab. Below is the sampling step where the label is fed into the model. Screenshot 2024-06-09 at 15 31 17

These labels will be associated with the generated samples. So you only need to concatenate generated samples with these labels if you want to have them in a single dataframe. Hope that helps.

Best regards, Timur

pnimeesha commented 1 week ago

Hi,

Thank you so much for the clarification! I have couple of more questions:

For the Philadelphia data, you have selected only few columns from the existing columns. Can you please let me know the reason behind your column selection (including label) in this case?
Can you please mention the hyper-parameters you have used for credit card data and Philadelphia data. It would be really helpful.
What are the hyper-params you used to run the TVAE model for both credit card and Philadelphia data? I see in the paper that you seem to get very good score using TVAE as well. However, my TVAE scores are very different (low) compared to the paper.

Thank you!