sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 304 forks source link

Transformers for context columns are not allowed to be updated #2111

Closed Pavamana15 closed 2 months ago

Pavamana15 commented 2 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

I used one of the available multi-sequence data sets online to generate a synthetic dataset. But I am getting the following errors.

Steps to reproduce

DATACEBO DATACEBO_1
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
npatki commented 2 months ago

Hi @Pavamana15 thanks for filing this issue and providing more details. I think the error message is a bit misleading. The root cause of the issue is that you are supplying the same column (Name) as both the sequence key and a context column.

For multi-sequence data, it is not allowed for your context column to be the same as your sequence key. The sequence key is meant to identify each sequence so by definition, it will never vary within each sequence. However a context column is typically another column (not the sequence key) that remains constant within a sequence. Removing the context column should fix your issue.

Resources:

Pavamana15 commented 2 months ago

Thank you so much @npatki . The error is resolved now,but it is taking a lot of time to run

Pavamana15 commented 2 months ago

@npatki I could generate synthetic data using PARSynthesizer with fewer epochs, i.e., 30 epochs. However, it does not generate synthetic data in sequential order. I mean, rows should be ordered in time. Original data had rows ordered in time. So why am I getting like this, or what mistake am I making?

Pavamana15 commented 2 months ago

@npatki Can you also explain how to evaluate the quality of synthetic data?

npatki commented 2 months ago

Hi @Pavamana15 here on GitHub, we typically file a separate issue for each topic you'd like to discuss. This helps keep the GitHub organized for other users who may have similar issues, and for tracking these in the future. With this in mind, will you please file new issues for the two new topics of performance testing and quality? Appreciate your help in keeping our GitHub organized.

I will close out this initial issue because it seems like the original problem (error that you were seeing) has been resolved. Thanks.