sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

What is the difference between combining and separating tables? #2092

Open limhasic opened 1 week ago

limhasic commented 1 week ago

After multi-table synthesis and joining all tables, existing single table synthesis
What is the difference between combining and separating tables?

  1. USE Multi Table Metadata API
  2. split after Join table and USE Single Table Metadata API

Could you please explain it well?

srinify commented 5 days ago

Hi @limhasic 👋

I'm a bit confused about exactly what you're asking -- do you mind clarifying a bit further? I understood your question to be -- "Why keep tables laid out in a multi-table pattern when I can just combine them into a single table and use SDV instead that way?" If this is incorrect, let me know!

Here's the relevant key differences:

Single Table: Works best when you have a single identifier column (e.g. user_id) that can uniquely link and identify the entities in your data. If you have other columns with identifier-like properties (e.g. post_id) in the same dataset, then single table models will not learn the relationships between your primary identifier column (user_id) and your secondary one (post_id). Your synthetic data may have rows containing user_id and post_id value pairs that don't exist in your real data

Multi Table: Supports cases where you have multiple identifier / id columns in your data that have a relational link between them. With Multi Table, you can specify the relationships between identifier columns and SDV will learn to model them more effectively. For example, SDV will maintain referential integrity when generating synthetic data (e.g. the combinations of user_id and post_id will match the same ones in your real data)

npatki commented 5 days ago

Hi @limhasic,

To add to this, we always recommend you to use with data that is the closest to its original source. The more you modify the data (splitting, joining, etc.), the more logic/dependencies you will be introducing into your dataset. As a result, it becomes much more difficult for SDV synthesizers to learn this out-of-the-box, because they must reverse-engineer all the changes that were introduced.

Hope that helps, and as @srinify mentioned, it would be helpful if you can provide an example to help us clarify the question further. Thanks.