sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.34k stars 310 forks source link

GaussianCopula generates Duplicate Samples #2265

Open MiladRadInDash opened 1 week ago

MiladRadInDash commented 1 week ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I need to in parallel generate multiple synthetic tables and concatenate them together. When I try this wiith concurrency or a simple for loop, most of the times, I get similar samples back which is not serving the purpose.

What I already tried

I have tried changing the sample numbers in each back, but still have had no luck.

npatki commented 1 week ago

Hi @MiladRadInDash, nice to meet you.

Currently, all of our publicly-available synthesizers are designed to generate data in a deterministic way. This is why we have methods such as reset_sampling which allow you to get back to the 0-state (right after a synthesizer is fit).

my_synthesizer.fit(data)

synthetic1 = my_synthesizer.sample(num_rows=100)
synthetic2 = my_synthesizer.sample(num_rows=100)

my_synthesizer.reset_sampling() # reset to original state after fitting

synthetic3 = my_synthesizer.sample(num_rows=100) # same as synthetic1

I understand this may not be entirely useful when concurrency is desired. Some users in the past have had success in manually unsetting the seed -- as an example, see https://github.com/sdv-dev/SDV/issues/1483

But we can also consider this as a feature request to support natively.

To help us prioritize, any more info would be useful about why you need to generate the synthetic data in parallel. Which synthesizer are you using? Are you exploring parallelization because you are finding it too slow, or is there another reason why parallelization would be useful?