sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

Context column cannot be a sequence key: Need better error message for this case #2097

Open npatki opened 5 days ago

npatki commented 5 days ago

Environment Details

Error Description

For sequential data, it should not be possible for the sequence key to be the same as a context column. This is because the sequence key is an identifier for each sequence, whereas a context column is just another column that happens to never vary within a sequence. There is no need to declare a sequence key as a context column because a sequence key is already guaranteed not to vary within a sequence -- rather, it is defining what a sequence is.

Yet, the code somehow allows me to instantiate a PARSynthesizer with a context column the same as the sequence key. When I try to fit it, I get an error that is not really relevant to the issue.

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.sequential import PARSynthesizer

metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'A': { 'sdtype': 'id' },
        'B': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
        'C': { 'sdtype': 'numerical' },
        'D': { 'sdtype': 'categorical' }
    },
    'sequence_key': 'A'
})

data = pd.DataFrame(data={
    'A': [0, 0, 0, 1, 1, 1],
    'B': ['2020-03-02', '2020-03-04', '2020-03-05', '2020-03-01', '2020-03-03', '2020-03-06'],
    'C': [12, 13, 34, 10, 45, 21],
    'D': ['Yes', 'Yes', 'Yes', 'No', 'No', 'No']
})

synth = PARSynthesizer(metadata, context_columns=['A'])
synth.fit(data)

Error:

[/usr/local/lib/python3.10/dist-packages/sdv/sequential/par.py](https://localhost:8080/#) in update_transformers(self, column_name_to_transformer)
    298         """
    299         if set(column_name_to_transformer).intersection(set(self.context_columns)):
--> 300             raise SynthesizerInputError(
    301                 'Transformers for context columns are not allowed to be updated.')
    302 

SynthesizerInputError: Transformers for context columns are not allowed to be updated.

Expected Behavior

I should not be allowed to even instantiate a PARSynthesizer if any of the context columns are sequence keys. This should immediately throw an error explaining that it is not allowed.

synth = PARSynthesizer(metadata, context_columns=['A'])
SynthesizerInputError: The sequence key ('A') cannot be a context column. To proceed, please remove the sequence key from the 'context_columns' parameter.