sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.36k stars 312 forks source link

Multitable Child Table Size #2154

Closed ldhlong closed 2 months ago

ldhlong commented 3 months ago

Environment details

SDV version:1.11.0 Python version: 3.11.8 Operating System: windows 11

Problem description

I’m interested in specifying number of rows on a foreign key table in multitable synthesis. I know the scale determines the size of every parent table, but is there a way to specify the size you would like the child table to be?

What I already tried

I’ve tried increasing the scale, but can’t find a relation between the size of the scale and the size of the child table. The size of the child table almost seems random. How does SDV determine the size of the child table in this case?

npatki commented 3 months ago

Hi @ldhlong,

The short answer is that the HMASynthesizer will algorithmically determine the size of the child based on the cardinality (branching factor) observed in the real data. For example in the real data, let's say that every parent row had between 4 and 5 children (let's say that 40% of parent rows had 4 children and 60% had 5 children). In theory, HMA will be able to learn that pattern and emulate it. Of course, there is a bit of randomness here (i.e. maybe in the synthetic data, about 45% of parents rows get 4 children each instead of 40%).

To see this in action, I would recommend visualizing the cardinality. In this plot, the synthetic data is meant to look similar to the real data. Though please share your results if you see otherwise.

Unfortunately, HMA's algorithm won't easily allow for fine-grained control over the exact # of rows of child tables. May I ask what is your use case that is creating these requirements for you? And what kind of a scale are you looking for -- are you trying to scale up or scale down the original data?

ldhlong commented 3 months ago

The goal was to upscale. Thank you.

npatki commented 3 months ago

Thanks @ldhlong. So the goal was to upscale but you were not receiving enough rows in the child table when doing so? It would be helpful if you could create and share the cardinality visualizations so that we can take a look ... and ensure that this is not a bug.

npatki commented 2 months ago

Hi @ldhlong, do you still have any questions about this? I'm going to close off the issue since we've answered your initial question and there hasn't been any activity. But please feel free to reply and we can always reopen to continue to the discussion.

For other topics or help with troubleshooting, you can always file a new issue as well. Thanks.