Closed echatzikyriakidis closed 1 year ago
@echatzikyriakidis thanks for filing. I'll label this as a bug and update the title to reflect the issue that this is causing.
Fixing this may require a small refactor. The intention is to set the size to the minimum of the real data and 1000 for each table in the multi table schema. Since tables may be of different sizes, this value may change for different tables.
Hi @npatki,
Thank you for your fast reply. Yes, indeed the sample size needs to be re-calculated for each table in the relational schema and probably a different instance of NewRowSynthesis needs to be created for each? Hope this doesn't introduce a lot of refactoring and effort.
How frequently, do you push changes in a new release? Do you think, you can fix this soon? Currently, I have found the mentioned hack to overcome the problem but it is not a clean solution.
Thank you!
real_data
is actually a dict and here the code counts the number of keys instead of rows.https://github.com/sdv-dev/SDMetrics/blob/d5af0d1d8135f7a3d06eb610474a3bcfa8268b8e/sdmetrics/reports/multi_table/diagnostic_report.py#L92