glederrey / DATGAN

Directed Acyclic Tabular GAN (DATGAN) for integrating expert knowledge in synthetic tabular data generation
GNU General Public License v3.0
14 stars 5 forks source link

Performance Metrics #2

Closed erenarkangil closed 9 months ago

erenarkangil commented 2 years ago

Hi there,

Thank you for sharing this repo. Your methodology is really impressive. I am a PhD level researcher at UQ and currently trying to use this algorithm for population upsampling. However, I am stuck to interpret SRMSE results:

Since I am upsampling the population, the difference between prediction and sample data is high. If I generate synthetic sample as the same size as real data, this wouldn't be a problem. I could not find this information in the paper. What would be the strategy to compare upsampled dataset?

Alternatively, I could use percentages but I did not feel comfortable because SRMSE is already normalized by the mean, and not sure if it would make sense to use percentages in that case.

Kind Regards

glederrey commented 2 years ago

Hi,

Thanks for your message. I'm really happy that people are trying my methodology for their own project. =)

So, I used SRMSE (and other metrics) because it's the usual tests used in the transportation literature. My goal was just to add a more systematic/robust way of using these metrics. But I'm not a fan of these. For example, what does a SRMSE of 0.1 means compared to a value of 1. I have no idea. So, it's quite tough to compare. That's why, in the end, I ranked the models based on the results without comparing the difference in values.

In your case, it seems that you have a well defined application for this methodology. Thus, I think you would benefit from using a metric that reflects this application. We had a long discussion about synthetic data assessment during my PhD defense and the conclusion was that we should use application-based metrics. For example, I worked with a student to develop a way to assess synthetic data when they are used to augment a dataset.

The idea is to check how adding synthetic data to real data would affect the performance of ML accuracy. So, we chose a ML method (in this case logistic regression is a bad choice since it does not necessarily need more data to perform better). Then, we trained the model using only real data. Then, we augmented the dataset such that it contains 10% of synthetic data, then 20%, etc. until we reach 50% of synthetic data and 50% of real data. Once you have this, you can see how the accuracy evolves with the proportion of synthetic data. An example of a result is shown below:

image

We also did a similar test where we did not augment the dataset but simply replaced the real data with synthetic data as shown below:

image

Using these two graphs, the conclusion we got where:

The DATGAN in these two graphs were very early versions of this current model. I haven't run the same experiment with the final model. But in the end, it shows that in the context of data augmentation, you can have some visual representation of your results without using SRMSE. Ofc, this was an early work and would require to be much more robust and better defined (use a different ML algorithm, test multiple variables, etc.)

So, in the end, my message for you is to come up with your own assessment method that reflects the application for which you're using synthetic data. The assessment needs to be in a controlled environment but it should be "similar" to the final application.

I hope that what I tried to explain is clear and that it might give you some ideas for your case. =) Don't hesitate if you have more questions. We could even have a zoom meeting to discuss it in more details. =)