sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

Wrong number of child rows in some multi-parent scenarios #535

Open pvk-developer opened 3 years ago

pvk-developer commented 3 years ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

In a multi-parent scenario, the first parent has the correct number of childs, but the following parents may have different number of childs.

Example

In this example we are expecting both parent to have the same number of child rows, however this is not being the case.

import pandas as pd
import sdv

parent_a = pd.DataFrame({
    'parent_id': range(5),
    'value': range(5)
})

parent_b = pd.DataFrame({
    'parent_id': range(5),
    'value': range(5)
})

child = pd.DataFrame({
    'parent_a': range(5),
    'parent_b': range(5),
    'value': range(5)
})

tables = {
    'parent_a': parent_a,
    'parent_b': parent_b,
    'child': child
}

metadata = sdv.Metadata()
metadata.add_table('parent_a', parent_a, primary_key='parent_id')
metadata.add_table('parent_b', parent_b, primary_key='parent_id')
metadata.add_table('child', child)
metadata.add_relationship('parent_a', 'child', 'parent_a')
metadata.add_relationship('parent_b', 'child', 'parent_b')

model = sdv.SDV()
model.fit(metadata, tables)

sampled = model.sample(num_rows=10)

print(len(sampled['child']['parent_a'].unique()))  # this is 10.
print(len(sampled['child']['parent_b'].unique()))  # this is less than 10.
katxiao commented 2 years ago

Hi @pvk-developer -- Currently, the num_rows parameter only guarantees that sampled['parent_a'] and sampled['parent_b'] both have 10 rows, which appears to be the case. It is not necessarily expected that every parent_b value appears in the child table.

Do you have a use case where you would like to specify the number of child rows per parent?