sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.39k stars 317 forks source link

PARSynthesizer is not learning rounding scheme for numerical columns #2274

Closed npatki closed 1 week ago

npatki commented 3 weeks ago

Environment Details

Error Description

First observed in #2241: If I have a numerical, sequential column with a particular rounding scheme, I would expect that all SDV synthesizers will learn the rounding scheme and ensure the synthetic data that is produced has the same. But this is not the case for PARSynthesizer.

Steps to reproduce

In the example below, the numerical column col_A is always rounded to 2 digits. Observe how the synthetic data does not follow that scheme.

import pandas as pd
import numpy as np

from sdv.metadata import Metadata
from sdv.sequential import PARSynthesizer

data = pd.DataFrame(data={
    'id': ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c'],
    'col_A': [5000.23, 4500.23, 4300.45, 2300.11, 3212.31, np.nan, 3456.34, 7890.12, 8201.00, 9810.12]
})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'sequence_key': 'id',
            'columns': {
                'id': { 'sdtype': 'id' },
                'col_A': { 'sdtype': 'numerical'}
            }
        },
    }
})

synthesizer = PARSynthesizer(metadata, epochs=1)
synthesizer.fit(data)
synthesizer.sample(num_sequences=2)
image

Additional Context

Observe also that other synthesizers such as the GaussianCopula are able to correctly learn the rounding scheme and produce synthetic data that is correctly formatted.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthesizer.sample(num_rows=5)
image