gretelai / gretel-synthetics

Synthetic data generators for structured and unstructured text, featuring differentially private learning.
https://gretel.ai/platform/synthetics
Other
579 stars 87 forks source link

Poor training results #129

Closed YYM0093 closed 1 year ago

YYM0093 commented 2 years ago

First of all, thank you for your research results! This is very helpful for my current research. I ran the code in "timeseries _dgan. ipynb", and the running results are what I need for my current research. Therefore, I try to perform the same processing on my dataset, but the distribution of generated data and original data is quite different. I would like to ask you what caused this and what direction I should improve. As shown in the figure, my dataset has a total of 37 attributes, all of which are discrete data. After visualization, I found that only 1-3 attributes met my expectations. image image image

kboyd commented 2 years ago

Thanks for trying out the model!

37 attributes is definitely more than we've used in our testing. A few parameter suggestions for the DGANConfig that may help:

How many training examples do you have? Especially on smaller (<1000 examples) data sets, it may need 10,000+ epochs to effectively train the weights in the model.

If you're able to share your training data or a notebook, we might have some more specific suggestions.

YYM0093 commented 2 years ago

Thanks for trying out the model!

37 attributes is definitely more than we've used in our testing. A few parameter suggestions for the DGANConfig that may help:

  • Be sure use_attribute_discriminator=True, this will use a dedicated Discriminator for the attributes and really helps match distributions
  • Increase attribute_coef_loss to focus the model more on the attributes, maybe try 10.0 and 100.0
  • Increase attribute_num_layers and attribute_num_units, the 37 attribute distributions may be too complex for the default network size in the attribute part of the Generator
  • Explore epochs, batch_size, and learning_rate, we often find the best performance after doing some parameter sweeps with these

How many training examples do you have? Especially on smaller (<1000 examples) data sets, it may need 10,000+ epochs to effectively train the weights in the model.

If you're able to share your training data or a notebook, we might have some more specific suggestions.

Thank you very much for your reply! I have 800 samples, and I want to generate 1000 samples. I will try your suggestions and tell you the first time if there are good results. Thanks again for your reply.

YYM0093 commented 2 years ago

Thanks for trying out the model!

37 attributes is definitely more than we've used in our testing. A few parameter suggestions for the DGANConfig that may help:

  • Be sure use_attribute_discriminator=True, this will use a dedicated Discriminator for the attributes and really helps match distributions
  • Increase attribute_coef_loss to focus the model more on the attributes, maybe try 10.0 and 100.0
  • Increase attribute_num_layers and attribute_num_units, the 37 attribute distributions may be too complex for the default network size in the attribute part of the Generator
  • Explore epochs, batch_size, and learning_rate, we often find the best performance after doing some parameter sweeps with these

How many training examples do you have? Especially on smaller (<1000 examples) data sets, it may need 10,000+ epochs to effectively train the weights in the model.

If you're able to share your training data or a notebook, we might have some more specific suggestions.

I just uploaded my notebook, which demonstrates a simple example. I hope you can see it clearly. Thank you again for your help! https://github.com/YYM0093/Gretel.ai-for-my-work

kboyd commented 1 year ago

Thanks, the example notebook is very helpful to see what's happening here!

First, my suggestions about use_attribute_discriminator, attribute_coef_loss, attribute_num_layers, and attribute_num_units, are not going to be useful. Sorry about that! I misunderstood what you were modeling somewhat. The DoppelGANger paper uses a very strict definition or attributes and features. Since all your variables are varying over time, none of them are part of the attributes as the term is used in the model. For more info, see the documentation for attribute_columns and feature_columns.

The main challenge here is that the time series model in your notebook only has 14 examples to train from. While there are 869 rows from the input table with Label=1, the time series model groups these into examples with 60 time points each. 14 examples of the time series dynamics is a really small amount of data for the neural network to effectively train.

But before just trying to throw more data at the problem, we should first make sure the time series setup is going to be useful for your downstream task. What do you want to do with this synthetic data after you generate it?

The current model will generate 60-minute long samples, but won't directly recreate the gradual drop over 14 hours in a variable like "Number of announcements". So the current setup may or may not be capable of meeting your goals in creating synthetic data. Happy to discuss more on this issue about whether this works, or if a different setup would be more effective.

Also we just released this model as part of Gretel's API, so besides using this gretel-synthetics repo, another option is to run jobs with our SDK using the Gretel Cloud. For more info, check out the announcement blog, model docs at docs.gretel.ai, and the blueprint with sample code using our SDK.

YYM0093 commented 1 year ago

Thank you very much for your answer, which solves my doubts. Gretel is really a cool invention. I don't think I have a detailed understanding of it. I will readjust the code according to your suggestions. If my research goes well, I will quote your code in the paper. Thank you again for your answer.

kboyd commented 1 year ago

Thanks for your interest in Gretel. I'll close this github issue for now, but feel free to reopen if you have additional questions.