sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

the PARSynthesizer data generation #2086

Open Myprojectjoy opened 1 week ago

Myprojectjoy commented 1 week ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

<For my data metadata.set_sequence_index(column_name='Date') but for metadata.set_sequence_key(column_name='???') What should I declare?.>

What I already tried

<SynthesizerInputError: The PARSythesizer is designed for multi-sequence data, identifiable through a sequence key. Your metadata does not include a sequence key..>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

image original data picture image After dropping unnecessary columns, this is the final data.

i want to generate new synthetic data of the column "Grid Active Power [kW]"

npatki commented 1 week ago

Hi @Myprojectjoy, the PARSynthesizer is only meant for multi-sequence data — where you have more than one sequence. If you cannot easily locate a sequence key, perhaps it’s an indication that your data is just 1 long sequence (rather that multiple sequences). In this case, unfortunately, I don’t believe your data is suitable for PARSynthesizer.

Could you describe further what this data is about, and what you’re hoping to do with synthetic data?

For more resources and an explanation of what multi-sequence data is, I'd recommend going through this tutorial

Myprojectjoy commented 1 week ago

The wind turbine data, it has the date and time with the generated power values. in some case the generated data is 0, or minus, it is when the turbine is not operational. i want to generate those data and then after that i want to use predictive modeling for prediction.

For my data metadata.set_sequence_index(column_name='Date') but for metadata.set_sequence_key(column_name='???') What should I declare?

npatki commented 1 week ago

Hi @Myprojectjoy thanks for your explanation. If you just have a single sequence of data, then there would be nothing to declare as a sequence key. I think in this case the PARSynthesizer is unfortunately not suitable for your data.

I'd strongly recommend going through this tutorial to better understand the concept of multi sequence data and see if it applies to your data.