Closed martinjurkovic closed 8 months ago
Hi @martinjurkovic,
You are correct in that the additional columns in these datasets contain aggregate statistics of related columns in other tables (mean
, min
, sum
, etc.). Additionally, I believe that the add_numerical
column is not an original column of the data -- but rather some random numerical noise.
I can confirm that such columns are probably not the best example for multi-table synthesizers. Our ML-based models are best for data that contain statistical patterns/trends. Any column that can be completely re-created from a pre-determined formula is not the best candidate for ML.
I've create a feature request at #1788 for tracking this in the longer-term.
(For some context, SDV started out as a research project many years ago, and I believe a few of the demo datasets are remnants of that original research.)
I'm going to close this issue off since we've answered the question. I'm marking this as a duplicate of #1788, where we can continue the discussion. Please feel free to reply there if you have any other feedback or followups. Thanks.
Environment details
Problem description
Looking at columns in the multi table demo datasets, all of the datasets containing
_v1
in the name (which is almost all of them) have additional columns in the tables. The code and the AWS bucket is listed at the bottom of the issue.Taken from the dataset
Accidents_v1
, tableupravna_enota
there are the following columns in addition to the original data: add_numerical,sum(x),sum(y),sum(x_wgs84),sum(y_wgs84),max(x),max(y),max(x_wgs84),max(y_wgs84),min(x),min(y),min(x_wgs84),min(y_wgs84),sum(starost),sum(vozniski_staz_LL),sum(vozniski_staz_MM),sum(alkotest),sum(strokovni_pregled),max(starost),max(vozniski_staz_LL),max(vozniski_staz_MM),max(alkotest),max(strokovni_pregled),min(starost),nb_rows_in_nesreca,nb_rows_in_osebaI believe these are additional features, which are aggregations of related table columns (mean, max, min, sum), row counts (nb_rowsin{related table}) and sum of numerical values in a column (add_numerical). Some of these but not always all of them are present in the
_v1
datasets. Columns for these features also have an entry in the corresponding metadata files for the datasets.If they are additional features, I believe they shouldn't be in the demo datasets as the HMA model then tries to model them and the evaluation report also takes them into account when calculating the report, which I believe does not make sense, since these columns are not actually in the original datasets and are just derived from the columns.
Therefore my question is whether these columns are intensionally added to the datasets and they should be modelled and evaluated, or should they be removed and calculated if a specific use case needs them.
What I already tried
The bucket URL I am referencing is: https://github.com/sdv-dev/SDV/blob/74baae90eb64abf52a5ea3e55b2017ef849fec6d/sdv/datasets/demo.py#L23
The code I am using: