sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Setting random state ahead of sampling to vary synthesised data #2177

Closed uros-r closed 2 weeks ago

uros-r commented 1 month ago

Problem description

Hi - I'm looking for a way to seed the generation of synthetic data to produce different samples repeatedly.

Looks like this used to be supported in past versions of the library, and may still be, but I can't get it to work with the current version. May well be missing something obvious.

What I already tried

Using the getting started example + SDV 1.15 :

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)
synthetic_data = synthesizer.sample(num_rows=500)

display(synthetic_data.head(2))

I've looked at the docs, tried setting the global np.random state / seed and torch seed (as recommended in a now-dated issue).

Also tried setting FIXED_RNG_SEED in base.py to a different value.

In all cases, synthetic_data remains identical.

Appreciate any help.

srinify commented 1 month ago

Hi there @uros-r to help me provide the best guidance, do you mind sharing more about your use case for controlling randomization?

Every time you sample from the same GaussianCopulaSynthesizer, you'll get new, random synthetic data. If you run the following code after your code example, s1 and s2 will have different values as you probably are already aware!

synthesizer.fit(data=real_data)
s1 = synthesizer.sample(num_rows=500)
s2 = synthesizer.sample(num_rows=500)

You can reset the randomization state using synthesizer.reset_sampling() to the same state when the synthesizer was fit. If you run the following code, s1 and s2 will be the same data.

synthesizer.fit(data=real_data)
s1 = synthesizer.sample(num_rows=500)
synthesizer.reset_sampling()
s2 = synthesizer.sample(num_rows=500)

You get a decent amount of control using these 2 methods, but again having more context into your use case would help!

srinify commented 3 weeks ago

Hi there @uros-r just following up :)

uros-r commented 2 weeks ago

Hey @srinify - thanks for the suggestions, much appreciated.

Our use case involves creation of a web based tool to allow users to interactively generate one or more anonymised dataset versions from a given source dataset.

For this, we came up with the workaround of using server-side session state to create and reuse synthesizer objects. This allows multiple calls to .sample() that produce different results, as you suggested.