sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

UniqueConstraints weren't applied properly when apply to relational table #1102

Open lynneq opened 1 year ago

lynneq commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

When apply unique constraint to account_number in a children table, the account_number in the sampled data should be unique.

Steps to reproduce

Reusing the example I found it here https://sdv.dev/SDV/user_guides/relational/constraints.html

from sdv.relational import HMA1
from sdv import load_demo, Metadata
from sdv.constraints import FixedCombinations, Unique

tables = load_demo()
print(tables)
tables['transactions']['account_number'] = [
    "016319083411",
    "016319083412",
    "016319083413",
    "016319083414",
    "016319083415",
    "016319083416",
    "016319083417",
    "016319083418",
    "016319083419",
    "016319083420",
]
metadata = Metadata()

metadata.add_table(
    name='users',
    data=tables['users'],
    primary_key='user_id'
)
constraint = FixedCombinations(column_names=['device', 'os'])
metadata.add_table(
    name='sessions',
    data=tables['sessions'],
    primary_key='session_id',
    parent='users',
    foreign_key='user_id',
    constraints=[constraint],
)

unique_constraint = Unique(column_names=['account_number'])
metadata.add_table(
    name='transactions',
    data=tables['transactions'],
    primary_key='transaction_id',
    parent='sessions',
    foreign_key='session_id',
    constraints=[unique_constraint]
)

metadata.get_table_meta('transactions')

model = HMA1(metadata)
model.fit(tables)
new_data = model.sample()

Result of transactions looks like this:

0,0,2019-01-26 17:50:48,99.40000,False,016319083420 1,1,2019-01-18 14:53:59,82.40000,False,016319083420 2,2,2019-01-14 22:07:44,85.30000,False,016319083411 3,3,2019-01-29 12:10:48,103.10000,False,016319083420 4,4,2019-01-02 01:40:07,118.50000,True,016319083411 5,5,2019-01-13 20:03:52,86.00000,False,016319083411 6,6,2019-01-04 09:33:45,71.60000,False,016319083411 7,7,2019-01-03 16:38:37,84.80000,False,016319083411 8,8,2019-01-17 19:38:09,101.40000,False,016319083420 9,9,2019-01-13 08:32:19,123.80000,True,016319083411

016319083411 and 016319083420 appeared multiple times.

Note: This doesn't happen if the unique constraint is on a parent table.

npatki commented 1 year ago

Thanks for filing @lynneq -- I can replicate this issue. It seems to be in the constraints logic because I can even replicate it if the column is numerical instead of categorical.

We will soon be refactoring some parts of the constraints code in the SDV, so I suspect it will resolve whatever's causing this. I'm keeping this open so that we can keep an eye on it.

npatki commented 1 year ago

Update: It seems like even with the new refactor, Unique is still having issues due to how it's designed. We'll continue to keep this issue open to provide future updates or workarounds.

One possible workaround: In the new, SDV 1.0 (Beta!) release, we introduce the concept of alternate keys. You can specify a column name as alternate key and the SDV will make sure that the values in the column are unique.

(Note that the key must be a type "text" or another PII type -- see new descriptions for more details.)