sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 288 forks source link

Demo datasets do not have referential integrity (`Carcinogenesis_v1`, `Toxicology_v1`) #1779

Closed npatki closed 2 months ago

npatki commented 5 months ago

Problem Description

In order to model multi-table data, SDV expects that all references between a foreign and primary key can be found. In other words, there is referential integrity within the dataset -- and no orphan children to be found.

Of all the demo datasets, Carcinogenesis_v1 and Toxicology_v1 do not have referential integrity and so cannot be modeled by any of the multi-table synthesizers.

Detailed Output

The metadata itself is valid, but the referential integrity is broken. Below is the output of calling metadata.validate_data(data) on these datasets.

output.txt

Fix

TBD. We can either remove these datasets from the demo altogether, or find a subsample of rows that do maintain referential integrity.

npatki commented 2 months ago

This was fixed as part of SDV #1788.

We used the drop_unknown_references feature to remove any unknown foreign key values. Both datasets now have referential integrity and work with the SDV synthesizers.