Transformers for context columns are not allowed to be updated

sdv-dev / SDV

Synthetic data generation for tabular data

https://docs.sdv.dev/sdv

Other

2.32k stars 304 forks source link

Transformers for context columns are not allowed to be updated #2111

Closed Pavamana15 closed 2 months ago

Pavamana15 commented 2 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version:
Python version:
Operating System:

Error Description

I used one of the available multi-sequence data sets online to generate a synthetic dataset. But I am getting the following errors.

Steps to reproduce

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

npatki commented 2 months ago

Hi @Pavamana15 thanks for filing this issue and providing more details. I think the error message is a bit misleading. The root cause of the issue is that you are supplying the same column (Name) as both the sequence key and a context column.

For multi-sequence data, it is not allowed for your context column to be the same as your sequence key. The sequence key is meant to identify each sequence so by definition, it will never vary within each sequence. However a context column is typically another column (not the sequence key) that remains constant within a sequence. Removing the context column should fix your issue.

Resources:

Duplicate issue found in #2097. We will provide better messaging for this.
This tutorial is a good resource for understanding multi-sequence data

Pavamana15 commented 2 months ago

Thank you so much @npatki . The error is resolved now,but it is taking a lot of time to run

Pavamana15 commented 2 months ago

@npatki I could generate synthetic data using PARSynthesizer with fewer epochs, i.e., 30 epochs. However, it does not generate synthetic data in sequential order. I mean, rows should be ordered in time. Original data had rows ordered in time. So why am I getting like this, or what mistake am I making?

Pavamana15 commented 2 months ago

@npatki Can you also explain how to evaluate the quality of synthetic data?

npatki commented 2 months ago

Hi @Pavamana15 here on GitHub, we typically file a separate issue for each topic you'd like to discuss. This helps keep the GitHub organized for other users who may have similar issues, and for tracking these in the future. With this in mind, will you please file new issues for the two new topics of performance testing and quality? Appreciate your help in keeping our GitHub organized.

I will close out this initial issue because it seems like the original problem (error that you were seeing) has been resolved. Thanks.