Open limhasic opened 1 week ago
Hi @limhasic 👋
I'm a bit confused about exactly what you're asking -- do you mind clarifying a bit further? I understood your question to be -- "Why keep tables laid out in a multi-table pattern when I can just combine them into a single table and use SDV instead that way?" If this is incorrect, let me know!
Here's the relevant key differences:
Single Table: Works best when you have a single identifier column (e.g. user_id
) that can uniquely link and identify the entities in your data. If you have other columns with identifier-like properties (e.g. post_id
) in the same dataset, then single table models will not learn the relationships between your primary identifier column (user_id
) and your secondary one (post_id
). Your synthetic data may have rows containing user_id
and post_id
value pairs that don't exist in your real data
Multi Table: Supports cases where you have multiple identifier / id columns in your data that have a relational link between them. With Multi Table, you can specify the relationships between identifier columns and SDV will learn to model them more effectively. For example, SDV will maintain referential integrity when generating synthetic data (e.g. the combinations of user_id
and post_id
will match the same ones in your real data)
Hi @limhasic,
To add to this, we always recommend you to use with data that is the closest to its original source. The more you modify the data (splitting, joining, etc.), the more logic/dependencies you will be introducing into your dataset. As a result, it becomes much more difficult for SDV synthesizers to learn this out-of-the-box, because they must reverse-engineer all the changes that were introduced.
Hope that helps, and as @srinify mentioned, it would be helpful if you can provide an example to help us clarify the question further. Thanks.
After multi-table synthesis and joining all tables, existing single table synthesis
What is the difference between combining and separating tables?
Could you please explain it well?