sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.37k stars 316 forks source link

Multitable Demo Datasets have additional columns like add_numerical, aggregations and nb_rows_in_{related table} #1776

Closed martinjurkovic closed 8 months ago

martinjurkovic commented 9 months ago

Environment details

Problem description

Looking at columns in the multi table demo datasets, all of the datasets containing _v1 in the name (which is almost all of them) have additional columns in the tables. The code and the AWS bucket is listed at the bottom of the issue.

Taken from the dataset Accidents_v1, table upravna_enota there are the following columns in addition to the original data: add_numerical,sum(x),sum(y),sum(x_wgs84),sum(y_wgs84),max(x),max(y),max(x_wgs84),max(y_wgs84),min(x),min(y),min(x_wgs84),min(y_wgs84),sum(starost),sum(vozniski_staz_LL),sum(vozniski_staz_MM),sum(alkotest),sum(strokovni_pregled),max(starost),max(vozniski_staz_LL),max(vozniski_staz_MM),max(alkotest),max(strokovni_pregled),min(starost),nb_rows_in_nesreca,nb_rows_in_oseba

I believe these are additional features, which are aggregations of related table columns (mean, max, min, sum), row counts (nb_rowsin{related table}) and sum of numerical values in a column (add_numerical). Some of these but not always all of them are present in the _v1 datasets. Columns for these features also have an entry in the corresponding metadata files for the datasets.

If they are additional features, I believe they shouldn't be in the demo datasets as the HMA model then tries to model them and the evaluation report also takes them into account when calculating the report, which I believe does not make sense, since these columns are not actually in the original datasets and are just derived from the columns.

Therefore my question is whether these columns are intensionally added to the datasets and they should be modelled and evaluated, or should they be removed and calculated if a specific use case needs them.

What I already tried

The bucket URL I am referencing is: https://github.com/sdv-dev/SDV/blob/74baae90eb64abf52a5ea3e55b2017ef849fec6d/sdv/datasets/demo.py#L23

The code I am using:

from sdv.datasets.demo import get_available_demos, download_demo

sdv_relational_datasets = get_available_demos('multi_table')
npatki commented 9 months ago

Hi @martinjurkovic,

You are correct in that the additional columns in these datasets contain aggregate statistics of related columns in other tables (mean, min, sum, etc.). Additionally, I believe that the add_numerical column is not an original column of the data -- but rather some random numerical noise.

I can confirm that such columns are probably not the best example for multi-table synthesizers. Our ML-based models are best for data that contain statistical patterns/trends. Any column that can be completely re-created from a pre-determined formula is not the best candidate for ML.

I've create a feature request at #1788 for tracking this in the longer-term.

(For some context, SDV started out as a research project many years ago, and I believe a few of the demo datasets are remnants of that original research.)

npatki commented 8 months ago

I'm going to close this issue off since we've answered the question. I'm marking this as a duplicate of #1788, where we can continue the discussion. Please feel free to reply there if you have any other feedback or followups. Thanks.