sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Foreign key references multiple tables #2136

Closed ldhlong closed 1 month ago

ldhlong commented 2 months ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I am experiencing a problem with multitable synthesis. I have two separate parent tables with id’s that I have defined as my primary keys. I have a child table that contains id’s from both parent tables. Both parent tables combined contain all of the foreign keys as all foreign keys must exist on the primary key. There is no option to combine the two parent tables and I am importing all tables as csv files.

parent table name: parent_table1 child table name: child_table parent primary key: id child foreign key: id

parent table name: parent_table2 child table name: child_table primary key:id child foreign key: id

SDV says that “foreign key column id contains unknown references” twice, one for each table when I try to run synthesizer.fit.

What I already tried

I’ve tried to define the relationships back to back within SDV as shown above. I’ve also tried to combine the two parent tables and was told that is not possible.

npatki commented 2 months ago

Hi @ldhlong nice to meet you. This is an interesting dataset you are working with. Is it one of your own, or are you testing it out on some publicly available data?

It will be helpful to clarify the schema. Based on you description above, I think what you're saying is that in the child_table, there is a single foreign key column called id that can refer to either parent_table1 or parent_table2. Is that correct? My question is: How do you know whether a particular foreign key value (in the child table id column) is referring to parent_table1 or parent_table2?

Currently, SDV does not natively support the case where a single foreign key column can refer to multiple different parents. We assume that the entire column must refer to the same table. That is likely what's causing the error.

That being said, there are some workarounds you can apply that we can help with. If you could confirm my understanding above is correct, I can provide some options. (BTW how did you know it is not possible to combine the two parent tables? Who told you?)

ldhlong commented 2 months ago

@npatki Thank you for your quick reply. You are correct that in the child table there is a single foreign key column id that can refer to either parent table 1 or parent table 2. I know which child rows of data connect to which parent table because it’s provided in the account type column. (And the rows are color coded according to each parent as well :-) )

Deeply interested in any work arounds you can provide.

Additionally, it was explained to me that the parent tables only have this id column in common, so they aren’t able to be joined (according to the individual that provided that tables to me.) I even asked if we could join the tables and fill the empty values resulting with null values. This was also confirmed not possible.

npatki commented 2 months ago

Thanks for confirming @ldhlong. This is very helpful info. We've seen this type of pattern before, and I'm going to update the title to reflect the details.

Let's assume you have a child table that looks something like this (the account type controls whether the foreign key goes to parent_table1 or parent_table2).

Account Type ID (Foreign Key)
Parent1 dkcoq-13
Parent2 locps-10
Parent2 ofjwp-93
Parent1 pqqwo-03
... ...

Workaround

The simplest solution would be to split the child table into 2 tables.

Create a child_table1 that just has rows for the first account type. Here, every single value in the id column refers only to parent_table1. Account Type ID (Foreign Key)
Parent1 dkcoq-13
Parent1 pqqwo-03
... ...

And also create a child_table2 for the other account type. Then in your metadata, you would need to supply relationships for each child table.

parent table name: parent_table1
child table name: child_table1
parent primary key: id
child foreign key: id

parent table name: parent_table2
child table name: child_table2
primary key: id
child foreign key: id

Let me know if that makes sense.

Important Note: I would recommend upgrading to the latest SDV version 1.15.0 in order to see the best results for this. After splitting the child table, your schema would be disconnected, so only newer version of SDV will work on it.

ldhlong commented 2 months ago

Is there no other solution that doesn’t involve splitting the child table or manipulating the tables? It’s fine if there isn’t, but I want to be completely sure.

This is because I have 5 tables total and there is another relationship on that child table. If I do split it into two, I fear that would complicate things and put me over the 5 table limit.

npatki commented 2 months ago

Hi @ldhlong, all the workarounds I can think of will require you to manipulate the tables and metadata in some way. This is because the data pattern you are describing is not natively supported by the SDV right now. Though once we add native support, you will no longer have to perform any workarounds (manipulate tables, update metadata, etc.).

This is because I have 5 tables total and there is another relationship on that child table. If I do split it into two, I fear that would complicate things and put me over the 5 table limit.

I think just adding 1-2 more tables shouldn't affect it too badly. HMASynthesizer will warn you if you're going over the 5 table limit, but it will not prevent you from continuing to fit and sample from the data. You may just have to wait slightly longer for the fitting process to finish.

Though if scalability becomes an issue, we have designed the HSASynthesizer precisely to solve it. At that point, you can always contact us to purchase SDV Enterprise.

Hope that helps!

npatki commented 1 month ago

Hi @ldhlong I have now spun up a new feature request in #2153 that we can use to track this data pattern. In the meantime, I'm closing off this issue since we have discussed a workaround for it.

If you have any thoughts about this particular data pattern, please feel free to reply to #2153. Otherwise, for new questions or feature requests, please don't hesitate to file a new issue. Thanks.