Closed fangyiyu closed 1 year ago
Hi @fangyiyu, very recently we updated our repo, which now uses probably latest version of everything but indeed the research project synthetic-text-generation-with-DP is still based on the previous version. Though I don't think datasets package version is very critical for us in fine-tune-dp.py
so it's okay to upgrade it.
In my opinion, the error you are getting from the prv_accountant
package is totally independent of the datasets
and most likely it has to do with your privacy parameters. If you could please let us know what you set as the target_epsilon (or noise multiplier) and your dataset size along with the effective batch size you'd like to use and the number of epochs, then we can see why we are getting a problem with prv_accountant
package.
Hi @huseyinatahaninan, thank you for the reply. I'm following the script for finetuning with DP on this page, though I only have one GPU instead of 8. Below is my script for finetuning, where you can see the target_epsilon is set to 4, training batch size is 32, validation batch size is 64. I'm using a very small dataset for preliminary experiment, the training dataset contains 100 instances with an average of 47 tokens in each instance, and the validation set contains 10 instances with the an average of 114 tokens in each instance.
python3 fine-tune-dp.py \
--data_dir data \
--output_dir output \
--model_name gpt2 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 16 \
--evaluation_strategy epoch \
--save_strategy epoch \
--log_level info \
--per_device_eval_batch_size 64 \
--eval_accumulation_steps 1 \
--seed 42 \
--target_epsilon 4.0 \
--per_sample_max_grad_norm 1.0 \
--weight_decay 0.01 \
--remove_unused_columns False \
--num_train_epochs 50 \
--logging_steps 10 \
--max_grad_norm 0 \
--sequence_len 128 \
--learning_rate 0.0001 \
--lr_scheduler_type constant \
--dataloader_num_workers 2 \
--disable_tqdm True \
--load_best_model_at_end True \
Thank you for your time and hope to hear from you soon.
Hi @fangyiyu, please note that the effective batch size is (number of GPUs x per_device_train_batch_size x gradient_accumulation_steps) and since you set --per_device_train_batch_size 32
and --gradient_accumulation_steps 16
your effective batch size is 32*64, which is more than the training dataset that contains 100 instances. I'd suggest you pick the effective batch size at most 10% of the training data so perhaps you can set something like --per_device_train_batch_size 8
and --gradient_accumulation_steps 1
and let me know if you are still getting an error.
If your training dataset that contains 100 instances is only for preliminary experiment that's okay but note that for better privacy-utility trade-off, you'd rather use much larger training datasets.
Thank you for the reply. I can successfully fine-tune GPT-2 using the latest dp-transformer
library that implements PEFT now.
When I finetune GPT-2 with DP using
fine-tune-dp.py
in dp-transformers/research/synthetic-text-generation-with-DP. An error occurred:TypeError: read_csv() got an unexpected keyword argument 'mangle_dupe_cols'
Seems like the error is caused by the versioning of
datasets,
if I upgradedatasets
to the latest version (2.14.6), the error disappeared, but another error occurred:ValueError: math domain error
I checked the source code and believe that this error should come from
math.log(1 / q - 1)
in theprv_accountant
package, which is a part ofdp_transformers
. I was trying to upgradedf_transformers
, butdf_transformers
requiresdatasets<=2.6.1,>=2.0.0
, which will cause the TypeError mentioned before.Any suggestion to avoid the incompatible libraries problem will be appreciated. Thank you!