sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.39k stars 317 forks source link

Key column contains repeating values. #2309

Open jalr4ever opened 10 hours ago

jalr4ever commented 10 hours ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

When I was using SDV for multi-table simulation today and specified a relationship, SDV reported an error saying the parent table's ID has duplicate values: "Error: Key column 'cust_no' contains repeating values." Does this mean SDV doesn't support duplicate values in the parent table's columns?

However, in real-world business tables, both the main table and the child table may have duplicate column values.

For example, in a multi-table simulation like the one below.

import pandas as pd

# parent
data_main = {
    'id': [1, 2, 2, 3, 4],
    'value': ['A', 'B', 'C', 'D', 'E']
}
df_main = pd.DataFrame(data_main)

# child
data_b = {
    'id': [2, 2, 3, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
}
df_b = pd.DataFrame(data_b)
df_merged = df_main.merge(df_b, on='id', how='left')

What I already tried

<Replace with a description of what you already tried and what is the behavior that you observe. If possible, also add below the exact code that you are running.>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
jalr4ever commented 10 hours ago

Or, in my current scenario, it's actually two tables where a regular column from the main table is joined with a regular column from the secondary table.It's not the ID column.

srinify commented 5 hours ago

Hi there @jalr4ever 👋

The SDV doesn't support repeating values for the primary key column in a table.

In SDV's metadata implementation:

If you have an ID column that's set as the primary key, you can update the metadata for it to just act as a regular ID column. (But if you have a multi-table setup, keep in mind that you will need to set another column as the primary key in the parent table.)

Most databases require that primary keys be unique, so I'm wondering if your column that contains repeated values is better suited to be set to the non-primary key ID sdtype? Or is it possible that cust_no is actually a foreign key to another parent table?

Or, in my current scenario, it's actually two tables where a regular column from the main table is joined with a regular column from the secondary table.It's not the ID column.

Can you tell more about the logic or pattern that's linking these 2 columns together? In addition, do you mind sharing more about the attributes of these 2 columns:

We're actively researching more complex relationships between tables, so I'd love to learn more about this table pattern here to see if there's an overlap!