9harshit commented 3 years ago

No matter the length of my dataset (1000, 4000, 10000, 100000) the following error is thrown

RuntimeError: Model training failed. Your training data may have too few records in it. Please try increasing your training rows and try again

Load dataset and build synthetic model

from gretel_helpers.synthetics import SyntheticDataBundle

Specify dataset

dataset_path = 'https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/healthcare-analytics-vidhya/train_data.csv' nrows = 10000 config_template = { "checkpoint_dir": "/content/sample_data/checkpoints3", "dp": True, # enable differential privacy in training "epochs": 25, # recommend 15-30 epochs to train production models "gen_lines": nrows, # number of lines to generate in first batch

"vocab_size": 20000

}

Gretel helpers to optimize the synthetic model

training_df = pd.read_csv(dataset_path) bundle = SyntheticDataBundle( training_df=training_df, auto_validate=False, # build record validators that learn per-column, these are used to ensure generated records have the same composition as the original synthetic_config=config_template, # the config for Synthetics )

zredlined commented 3 years ago

Hey @9harshit - It looks like you're using the same dataset as one of our reference examples. Have you tried running the template blueprint? It runs fine for me in Colab- I've attached a notebook here: https://colab.research.google.com/drive/1_vACH7_GMt5SA0ENc1UMdeZuV44h1pdG?usp=sharing

That said, here's the culprit in your config

It looks like you're running differential privacy, which is almost definitely what's causing your error. Typically- differential privacy can require very large and homogeneous input sets to work. To run DP on this dataset, try using a low dp_noise_multiplier (e.g. 0.01) and set dp_l2_norm_clip gradient clipping to 1.5. This will result in a higher epsilon value, but likely very good practical protections to prevent the model from memorizing private data. You can read more about different DP settings here: https://gretel.ai/blog/practical-privacy-with-synthetic-data

Other comments:

Is there a reason you're only running 25 epochs? I'd suggest using 100. Synthetics uses a validation set and early stopping by default which prevents overfitting.

Let us know if you have any other questions- or feel free to reach out on our Slack channel (https://gretel.ai/slackinvite)

9harshit commented 3 years ago

Thanks for the suggestions. I will try them Meanwhile I am trying to generate synthetic user location data. My dataset contains following columns : User_id, Timestamp, Latitude, Longitude Is it possible to get good synthetic data for this along with timestamp. And is it possible to get multiple and related data for the same user (i.e user_id 10 has 100 rows, user_id 11 has 100 and so on)

zredlined commented 3 years ago

@9harshit yes- check out our blog on synthetic time series as an example. https://github.com/gretelai/gretel-blueprints/blob/main/gretel/create_synthetic_data_from_time_series/blueprint.ipynb

Your config would look something like this

synthetic_df = TimeseriesModel(
    training_df=train_df,
    time_column="Timestamp",
    other_seed_columns=["User_id"],
    synthetic_config=config_template
).train().generate().df

9harshit commented 3 years ago

Awesome. Thankyou!

gretelai / gretel-blueprints

Too few records #37

Load dataset and build synthetic model

Specify dataset

"vocab_size": 20000

Gretel helpers to optimize the synthetic model