sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 288 forks source link

Constraints across multiple tables #1122

Open nelsonrogers opened 1 year ago

nelsonrogers commented 1 year ago

It would be cool to be able to set constraints taking values in different tables

Say you have an orders table with a status field and a delivery_date. Say the statusfield contains order_status_id from an order_status table.

Imagine the order_status table as follows: 1 : Processed 2 : Shipped 3 : Delivered

If the order has not yet been delivered, the status field should show 1 or 2.

In this situation, it would be good to have a constraint like a FixedCombination that would check that when a delivery_dateexists, the foreign key in the status field corresponds to Delivered in the order_status table.

npatki commented 1 year ago

Hi @nelsonrogers thanks for filing this issue! We'll keep this open for tracking and use it to update any progress.

A few follow-ups about your scenario:

  1. Are there are any other columns present in the order_status table?
  2. When you create synthetic data, do you wish to create brand new, synthetic order_status rows as well?

I think if order_status is just a reference table (that you do not want to synthesize), there may be other options we can look into such as denormalization.

flintliu commented 1 year ago

I am facing the "constraints across multiple tables" problem as well but with a different scenario.

Say you have a table called task, and child table called sub_task, for each task, it may have one or multiple sub tasks.

Say both task and sub_task table contains column is_finish, the possible values are 1 and 0, and is_finish in task cannot be 1 if it's any sub task is not finished.

montasIET commented 1 year ago

I am facing the "constraints across multiple tables" problem as well but with a different scenario.

Say you have a table called task, and child table called sub_task, for each task, it may have one or multiple sub tasks.

Say both task and sub_task table contains column is_finish, the possible values are 1 and 0, and is_finish in task cannot be 1 if it's any sub task is not finished.

I have a similar problem! Hope this feature will be added shortly

nelsonrogers commented 1 year ago

I'm sorry for not replying earlier, I must have missed the email.

It doesn't have other columns in this case as it was just my testimony data, but in other similar cases, I have come across the same issue.

I do not want to create a new table. To avoid recreating a different table, I have been using Unique as a workaround