sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.31k stars 304 forks source link

Support nullable foreign keys in HMA #2063

Closed rwedge closed 1 month ago

rwedge commented 3 months ago

Problem Description

HMASynthesizer does not currently support null values in foreign key columns. Adding the ability to handle null values for foreign keys would expand the range of datasets HMA can model.

Expected behavior

HMASynthesizer is able to fit on data that contains null values in foreign key columns and the presence of nulls is reflected in the sampled data.

Additional context

Changes to the fit process: When generating extension columns for a child table, treat null as a valid foreign key value and calculate the extension row values for a null parent. Store these null parent extension values separately from the parent table but still have them retrievable for sampling.

Changes to sampling: When creating a child table, HMA should leave some rows to be generated using the null parent's child synthesizer based on the percentage of null foreign keys in the relationship being used to create the child table (perhaps this could be handled in _enforce_table_size). When finding parent_ids of other foreign key columns on a child table, treat the stored null parent extension row as another parent candidate and create a corresponding synthesizer to get likelihoods from.