Closed 9harshit closed 3 years ago
Hey @9harshit - It looks like you're using the same dataset as one of our reference examples. Have you tried running the template blueprint? It runs fine for me in Colab- I've attached a notebook here: https://colab.research.google.com/drive/1_vACH7_GMt5SA0ENc1UMdeZuV44h1pdG?usp=sharing
That said, here's the culprit in your config
dp_noise_multiplier
(e.g. 0.01) and set dp_l2_norm_clip
gradient clipping to 1.5. This will result in a higher epsilon value, but likely very good practical protections to prevent the model from memorizing private data. You can read more about different DP settings here: https://gretel.ai/blog/practical-privacy-with-synthetic-dataOther comments:
Let us know if you have any other questions- or feel free to reach out on our Slack channel (https://gretel.ai/slackinvite)
Thanks for the suggestions. I will try them Meanwhile I am trying to generate synthetic user location data. My dataset contains following columns : User_id, Timestamp, Latitude, Longitude Is it possible to get good synthetic data for this along with timestamp. And is it possible to get multiple and related data for the same user (i.e user_id 10 has 100 rows, user_id 11 has 100 and so on)
@9harshit yes- check out our blog on synthetic time series as an example. https://github.com/gretelai/gretel-blueprints/blob/main/gretel/create_synthetic_data_from_time_series/blueprint.ipynb
Your config would look something like this
synthetic_df = TimeseriesModel(
training_df=train_df,
time_column="Timestamp",
other_seed_columns=["User_id"],
synthetic_config=config_template
).train().generate().df
Awesome. Thankyou!
No matter the length of my dataset (1000, 4000, 10000, 100000) the following error is thrown
RuntimeError: Model training failed. Your training data may have too few records in it. Please try increasing your training rows and try again
Load dataset and build synthetic model
from gretel_helpers.synthetics import SyntheticDataBundle
Specify dataset
dataset_path = 'https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/healthcare-analytics-vidhya/train_data.csv' nrows = 10000 config_template = { "checkpoint_dir": "/content/sample_data/checkpoints3", "dp": True, # enable differential privacy in training "epochs": 25, # recommend 15-30 epochs to train production models "gen_lines": nrows, # number of lines to generate in first batch
"vocab_size": 20000
}
Gretel helpers to optimize the synthetic model
training_df = pd.read_csv(dataset_path) bundle = SyntheticDataBundle( training_df=training_df, auto_validate=False, # build record validators that learn per-column, these are used to ensure generated records have the same composition as the original synthetic_config=config_template, # the config for Synthetics )