Key column contains repeating values.

jalr4ever commented 10 hours ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

SDV version:
Python version:
Operating System:

Problem description

When I was using SDV for multi-table simulation today and specified a relationship, SDV reported an error saying the parent table's ID has duplicate values: "Error: Key column 'cust_no' contains repeating values." Does this mean SDV doesn't support duplicate values in the parent table's columns?

However, in real-world business tables, both the main table and the child table may have duplicate column values.

For example, in a multi-table simulation like the one below.

import pandas as pd

# parent
data_main = {
    'id': [1, 2, 2, 3, 4],
    'value': ['A', 'B', 'C', 'D', 'E']
}
df_main = pd.DataFrame(data_main)

# child
data_b = {
    'id': [2, 2, 3, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
}
df_b = pd.DataFrame(data_b)
df_merged = df_main.merge(df_b, on='id', how='left')

What I already tried

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

jalr4ever commented 10 hours ago

Or, in my current scenario, it's actually two tables where a regular column from the main table is joined with a regular column from the secondary table.It's not the ID column.

srinify commented 5 hours ago

Hi there @jalr4ever 👋

The SDV doesn't support repeating values for the primary key column in a table.

In SDV's metadata implementation:

An ID column can be the primary key for a table (e.g. id column for a guests table) or a non-primary key for a table (e.g. a foreign key column like hotel_id in a guests table)
You can have multiple ID columns that are not the table's primary key but only one that is the primary key, for a given table
The ID column that is set as the primary key for a table must contain a unique value for each row

If you have an ID column that's set as the primary key, you can update the metadata for it to just act as a regular ID column. (But if you have a multi-table setup, keep in mind that you will need to set another column as the primary key in the parent table.)

Most databases require that primary keys be unique, so I'm wondering if your column that contains repeated values is better suited to be set to the non-primary key ID sdtype? Or is it possible that cust_no is actually a foreign key to another parent table?

Or, in my current scenario, it's actually two tables where a regular column from the main table is joined with a regular column from the secondary table.It's not the ID column.

Can you tell more about the logic or pattern that's linking these 2 columns together? In addition, do you mind sharing more about the attributes of these 2 columns:

Do both of the columns contain repeating values, or just the one in the parent table?
Which of the 2, if any, contain missing values
Can the column from the parent table also be joined with other table(s) in your data?
Do both columns represent the same concept in your domain?

We're actively researching more complex relationships between tables, so I'd love to learn more about this table pattern here to see if there's an overlap!

sdv-dev / SDV