gretelai / gretel-synthetics

Synthetic data generators for structured and unstructured text, featuring differentially private learning.
https://gretel.ai/platform/synthetics
Other
589 stars 87 forks source link

Sample_len Value #142

Closed FrancisNji closed 1 year ago

FrancisNji commented 1 year ago

Are you reporting a bug or FR? No.

What version of synthetics are you using?

**What is the variable "Sample_len" and what does its value stand for? How is this value chosen?** **Am using GPU** **What environment are you working in?** **What version of python are you using?** <3.9> **Describe the shape / types of the data you are training on** <15,884 X 11, floating numbers> daily data and each entry represents a day. Also is there any published paper describing this work/approach? If yes, please provide link to paper ***I will be very grateful for your prompt response***
kboyd commented 1 year ago

Thanks for the question @Frankie0609! I'm guessing this question is for the timeseries_dgan model. Let me know if that's not the case.

In our DGAN model, sample_len controls some internals of how we model a time series. The max_sequence_len parameter is how many time points are in each of your example time series. sample_len needs to divide max_sequence_len evenly, and is used to implicitly split the sequence into smaller chunks for the model to work with. Specifically, DGAN uses an RNN architecture and sample_len is how many time points are generated from each cell of the RNN.

We recommend using sample_len=1 for shorter time sequences, say up to ~20 (max_sequence_len=20). In longer sequences, being able to experiment with different values for sample_len allows you to explore the tradeoffs between a larger model that probably requires more data to train(small sample_len) and a smaller model with faster per epoch training (larger sample_len). It can also be very useful if you know there's periodicity in your data, e.g., use sample_len=7 for daily data with weekly patterns, though this is not required.

There's a few places to learn more about this model. For this particular implementation, see our blog posts https://gretel.ai/blog/create-synthetic-time-series-with-doppelganger-and-pytorch and https://gretel.ai/blog/generate-time-series-data-with-gretels-new-dgan-model. And our PyTorch implementation is based on the DoppelGANger model published in https://arxiv.org/abs/1909.13403. This paper has some discussion about including sample_len as a configurable parameter for the model.

Hope that information helps! Let me know if you have any other questions.

FrancisNji commented 1 year ago

Much thanks for this clarification