sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

Multi table DiagnosticReport sets `synthetic_sample_size` too low for `NewRowSynthesis` #320

Closed echatzikyriakidis closed 1 year ago

echatzikyriakidis commented 1 year ago

real_data is actually a dict and here the code counts the number of keys instead of rows.

https://github.com/sdv-dev/SDMetrics/blob/d5af0d1d8135f7a3d06eb610474a3bcfa8268b8e/sdmetrics/reports/multi_table/diagnostic_report.py#L92

npatki commented 1 year ago

@echatzikyriakidis thanks for filing. I'll label this as a bug and update the title to reflect the issue that this is causing.

Fixing this may require a small refactor. The intention is to set the size to the minimum of the real data and 1000 for each table in the multi table schema. Since tables may be of different sizes, this value may change for different tables.

echatzikyriakidis commented 1 year ago

Hi @npatki,

Thank you for your fast reply. Yes, indeed the sample size needs to be re-calculated for each table in the relational schema and probably a different instance of NewRowSynthesis needs to be created for each? Hope this doesn't introduce a lot of refactoring and effort.

How frequently, do you push changes in a new release? Do you think, you can fix this soon? Currently, I have found the mentioned hack to overcome the problem but it is not a clean solution.

Thank you!