Open xiehuanyi opened 1 year ago
Hi @xiehuanyi, our example was based on transformers==4.20.1
version and although I am sure it'd work with more recent versions, we did not have time to test for the newest versions unfortunately. I'll try to do that sometime this week but in the meantime if it's okay for you to use older versions, that would be the fastest route right now.
just fyi it works up to transformers==4.28.1
but beyond that it may need some changes. I see that elif self.deepspeed: ...
part is causing the issue so you can comment it out to see if it works with transformers==4.30.1
but I have not fully tested with this version so I don't know yet if it requires further changes or not.
just fyi it works up to
transformers==4.28.1
but beyond that it may need some changes. I see thatelif self.deepspeed: ...
part is causing the issue so you can comment it out to see if it works withtransformers==4.30.1
but I have not fully tested with this version so I don't know yet if it requires further changes or not.
Great! I will try it. Thanks ~
I just tried a toy dataset out. and found it very strange. I ran it three times and it worked fine everytime without differential privacy. However, when using differential privacy, it may fail from time to time. My code is shown as below.
from dp_transformers.grad_sample.transformers import conv_1d
from transformers import AutoModelForCausalLM, GPT2Model
from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset
import torch
class ToyData(Dataset):
def __init__(self):
super().__init__()
def __getitem__(self, index):
return (
torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]),
torch.tensor([1, 2, 3, 4, 5, 6, 7, 8]),
torch.tensor([1, 2, 3, 4, 5, 6, 7, 8])
)
def __len__(self):
return 100
def run(use_dp):
data_loader = DataLoader(ToyData(), batch_size=8)
model = AutoModelForCausalLM.from_pretrained("sshleifer/tiny-gpt2")
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
from opacus import PrivacyEngine
model = model.train()
if use_dp:
pe = PrivacyEngine()
model, opt, data_loader = pe.make_private(
module=model,
optimizer=opt,
data_loader=data_loader,
noise_multiplier=1.3,
max_grad_norm=1.0)
for epoch in range(100):
for batch in data_loader:
# print([i.shape for i in batch])
loss = model(input_ids=batch[0], labels=batch[1], position_ids=batch[2]).loss
loss.backward()
opt.step()
opt.zero_grad()
print(loss.item())
for use_dp in [True, False]:
for i in range(3):
try:
run(use_dp)
print(f"use_dp: {use_dp} success")
except Exception as e:
print(f"use_dp: {use_dp} error msg: {e}")
and here is the output
[/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/opacus/privacy_engine.py:141](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224c696e7578227d.vscode-resource.vscode-cdn.net/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/opacus/privacy_engine.py:141): UserWarning: Secure RNG turned off. This is perfectly fine for experimentation as it allows for much faster training performance, but remember to turn it on and retrain one last time before production with ``secure_mode`` turned on.
warnings.warn(
[/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1053](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224c696e7578227d.vscode-resource.vscode-cdn.net/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1053): UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
[/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1018](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224c696e7578227d.vscode-resource.vscode-cdn.net/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/torch/nn/modules/module.py:1018): UserWarning: Using non-full backward hooks on a Module that does not return a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_output. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using non-full backward hooks on a Module that does not return a "
[/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/torch/autograd/__init__.py:173](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a224c696e7578227d.vscode-resource.vscode-cdn.net/home/huanyi/miniconda3/envs/dlenv/lib/python3.10/site-packages/torch/autograd/__init__.py:173): UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
1.9542039632797241
use_dp: True success
1.9521187543869019
use_dp: True success
use_dp: True error msg: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.FloatTensor instead (while checking arguments for embedding)
1.9460290670394897
use_dp: False success
1.9460164308547974
use_dp: False success
1.9460010528564453
use_dp: False success
I followed the tips given:
excuse me, which reddit dataset do you use? I didn't find 'reddit' on huggingface, however, I found someone similar. You can find it here: https://huggingface.co/datasets/solomonk/reddit. However, it seems having some incompatible problems. Since connection from China mainland to the huggingface is unstable, I used git-lfs to clone the dataset like this: git-lfs clone https://huggingface.co/datasets/solomonk/reddit
It is stored under the directory 'dp-transformers', and it's found properly. However, I got an error during the running. My command and output are shown below:
command
python examples/nlg-reddit/sample-level-dp/fine-tune-dp.py \
--output_dir scratch \
--model_name tiny-gpt2 \
--sequence_len 128 \
--per_device_train_batch_size 32 \
--gradient_accumulation_steps 2 \
--evaluation_strategy steps \
--eval_steps 45 \
--log_level info \
--per_device_eval_batch_size 64 \
--eval_accumulation_steps 1 \
--seed 42 \
--target_epsilon 8 \
--per_sample_max_grad_norm 1.0 \
--prediction_loss_only \
--weight_decay 0.01 \
--remove_unused_columns False \
--num_train_epochs 3 \
--logging_steps 5 \
--max_grad_norm 0 \
--lr_scheduler_type constant \
--learning_rate 1e-4 \
--disable_tqdm True \
--dataloader_num_workers 2
output:
07/13/2023 12:06:24:WARNING:Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
07/13/2023 12:06:24:INFO:Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=2,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=True,
do_eval=True,
do_predict=False,
do_train=False,
dry_run=False,
eval_accumulation_steps=1,
eval_delay=0,
eval_steps=45,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=2,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0001,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=info,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=scratch/runs/Jul13_12-06-23_df0caa500212d011ee0917a0c7f822b9ff09-task1-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=5,
logging_strategy=steps,
lr_scheduler_type=constant,
max_grad_norm=0.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=scratch,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=64,
per_device_train_batch_size=32,
prediction_loss_only=True,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=[],
resume_from_checkpoint=None,
run_name=scratch,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.01,
xpu_backend=None,
)
07/13/2023 12:06:24:INFO:Privacy parameters PrivacyArguments(per_sample_max_grad_norm=1.0, noise_multiplier=None, target_epsilon=8.0, target_delta=None, disable_dp=False)
[INFO|configuration_utils.py:667] 2023-07-13 12:06:24,025 >> loading configuration file tiny-gpt2/config.json
[INFO|configuration_utils.py:725] 2023-07-13 12:06:24,026 >> Model config GPT2Config {
"_name_or_path": "tiny-gpt2",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 2,
"n_head": 2,
"n_inner": null,
"n_layer": 2,
"n_positions": 1024,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.1,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.30.2",
"use_cache": true,
"vocab_size": 50257
}
[INFO|modeling_utils.py:2575] 2023-07-13 12:06:24,052 >> loading weights file tiny-gpt2/pytorch_model.bin
[INFO|configuration_utils.py:577] 2023-07-13 12:06:24,059 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 50256,
"eos_token_id": 50256,
"transformers_version": "4.30.2"
}
[INFO|modeling_utils.py:3295] 2023-07-13 12:06:24,213 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.
[INFO|modeling_utils.py:3304] 2023-07-13 12:06:24,213 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at tiny-gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|modeling_utils.py:2928] 2023-07-13 12:06:24,214 >> Generation config file not found, using a generation config created from the model config.
07/13/2023 12:06:27:INFO:Some files matched the pattern 'reddit/**' at /code/dp-transformers/reddit but don't have valid data file extensions: [PosixPath('/code/dp-transformers/reddit/RS_2006-01.zst'), PosixPath('/code/dp-transformers/reddit/RC_2006-01.bz2')]
07/13/2023 12:06:27:WARNING:Using custom data configuration reddit-622990d947526d4c
07/13/2023 12:06:27:INFO:Loading Dataset Infos from /opt/conda/lib/python3.7/site-packages/datasets/packaged_modules/json
07/13/2023 12:06:27:INFO:Generating dataset json (/root/.cache/huggingface/datasets/json/reddit-622990d947526d4c/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)
Downloading and preparing dataset json/reddit to /root/.cache/huggingface/datasets/json/reddit-622990d947526d4c/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab...
07/13/2023 12:06:28:INFO:Dataset not on Hf google storage. Downloading and preparing it from source
07/13/2023 12:06:28:INFO:Downloading took 0.0 min
07/13/2023 12:06:28:INFO:Checksum Computation took 0.0 min
07/13/2023 12:06:28:INFO:Unable to verify checksums.
07/13/2023 12:06:28:INFO:Generating train split
Traceback (most recent call last):
File "examples/nlg-reddit/sample-level-dp/fine-tune-dp.py", line 141, in <module>
main(Arguments(train=train_args, privacy=privacy_args, model=model_args))
File "examples/nlg-reddit/sample-level-dp/fine-tune-dp.py", line 86, in main
dataset = datasets.load_dataset('reddit')
File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1747, in load_dataset
use_auth_token=use_auth_token,
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 818, in download_and_prepare
**download_and_prepare_kwargs,
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 905, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1520, in _prepare_split
writer.write_table(table)
File "/opt/conda/lib/python3.7/site-packages/datasets/arrow_writer.py", line 540, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/opt/conda/lib/python3.7/site-packages/datasets/table.py", line 2068, in table_cast
return cast_table_to_schema(table, schema)
File "/opt/conda/lib/python3.7/site-packages/datasets/table.py", line 2029, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
archived: bool
author: string
author_flair_background_color: string
author_flair_css_class: null
author_flair_richtext: list<item: null>
child 0, item: null
author_flair_text: null
author_flair_text_color: string
author_flair_type: string
brand_safe: bool
can_gild: bool
contest_mode: bool
created_utc: int64
distinguished: null
domain: string
edited: bool
gilded: int64
hidden: bool
hide_score: bool
id: string
is_crosspostable: bool
is_reddit_media_domain: bool
is_self: bool
is_video: bool
link_flair_css_class: null
link_flair_richtext: list<item: null>
child 0, item: null
link_flair_text: null
link_flair_text_color: string
link_flair_type: string
locked: bool
media: null
media_embed: struct<>
no_follow: bool
num_comments: int64
num_crossposts: int64
over_18: bool
parent_whitelist_status: string
permalink: string
rte_mode: string
score: int64
secure_media: null
secure_media_embed: struct<>
selftext: string
send_replies: bool
spoiler: bool
stickied: bool
subreddit: string
subreddit_id: string
subreddit_name_prefixed: string
subreddit_type: string
suggested_sort: null
thumbnail: string
thumbnail_height: int64
thumbnail_width: int64
title: string
url: string
whitelist_status: string
post_hint: string
preview: struct<enabled: bool, images: list<item: struct<id: string, resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>, variants: struct<nsfw: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>, obfuscated: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>>>>>
child 0, enabled: bool
child 1, images: list<item: struct<id: string, resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>, variants: struct<nsfw: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>, obfuscated: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>>>>
child 0, item: struct<id: string, resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>, variants: struct<nsfw: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>, obfuscated: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>>>
child 0, id: string
child 1, resolutions: list<item: struct<height: int64, url: string, width: int64>>
child 0, item: struct<height: int64, url: string, width: int64>
child 0, height: int64
child 1, url: string
child 2, width: int64
child 2, source: struct<height: int64, url: string, width: int64>
child 0, height: int64
child 1, url: string
child 2, width: int64
child 3, variants: struct<nsfw: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>, obfuscated: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>>
child 0, nsfw: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>
child 0, resolutions: list<item: struct<height: int64, url: string, width: int64>>
child 0, item: struct<height: int64, url: string, width: int64>
child 0, height: int64
child 1, url: string
child 2, width: int64
child 1, source: struct<height: int64, url: string, width: int64>
child 0, height: int64
child 1, url: string
child 2, width: int64
child 1, obfuscated: struct<resolutions: list<item: struct<height: int64, url: string, width: int64>>, source: struct<height: int64, url: string, width: int64>>
child 0, resolutions: list<item: struct<height: int64, url: string, width: int64>>
child 0, item: struct<height: int64, url: string, width: int64>
child 0, height: int64
child 1, url: string
child 2, width: int64
child 1, source: struct<height: int64, url: string, width: int64>
child 0, height: int64
child 1, url: string
child 2, width: int64
retrieved_on: int64
to
{'gilded': Value(dtype='int64', id=None), 'distinguished': Value(dtype='null', id=None), 'retrieved_on': Value(dtype='int64', id=None), 'author_flair_text': Value(dtype='null', id=None), 'author': Value(dtype='string', id=None), 'edited': Value(dtype='bool', id=None), 'id': Value(dtype='string', id=None), 'parent_id': Value(dtype='string', id=None), 'subreddit': Value(dtype='string', id=None), 'score': Value(dtype='int64', id=None), 'ups': Value(dtype='int64', id=None), 'created_utc': Value(dtype='int64', id=None), 'author_flair_css_class': Value(dtype='null', id=None), 'body': Value(dtype='string', id=None), 'controversiality': Value(dtype='int64', id=None), 'subreddit_id': Value(dtype='string', id=None), 'stickied': Value(dtype='bool', id=None), 'link_id': Value(dtype='string', id=None)}
because column names don't match
It seems I used the wrong version of dataset.
dataset = datasets.load_dataset('reddit', split="train[:500000]").train_test_split(0.02, seed=args.train.seed)
This is where we load the dataset and we were using datasets==2.0.0
. Does this not work for you?
dataset = datasets.load_dataset('reddit', split="train[:500000]").train_test_split(0.02, seed=args.train.seed)
This is where we load the dataset and we were usingdatasets==2.0.0
. Does this not work for you?
It turns out that my network was not stable which led to the failure. And I clone the dataset with git-lfs. It works fine for me. Thanks a lot!
glad that it helps! Sorry I did not get a chance to look at the other error you got but it does not look so much of an error related to DP really.
Hi I ran the sample-level example with dp using the command (run on local machine by creating conda environment):
python -m torch.distributed.run --nproc_per_node 1 fine-tune-dp.py \
--output_dir scratch \
--sequence_len 128 \
--per_device_train_batch_size 64 \
--gradient_accumulation_steps 1 \
--evaluation_strategy steps \
--eval_steps 45 \
--log_level info \
--per_device_eval_batch_size 64 \
--eval_accumulation_steps 1 \
--seed 42 \
--target_epsilon 8 \
--per_sample_max_grad_norm 1.0 \
--prediction_loss_only \
--weight_decay 0.01 \
--remove_unused_columns False \
--num_train_epochs 3 \
--logging_steps 5 \
--lora_dim 4 \
--lora_alpha 32 \
--lora_dropout 0.0 \
--max_grad_norm 0 \
--lr_scheduler_type constant \
--learning_rate 3e-4 \
--disable_tqdm True \
--dataloader_num_workers 2 \
--label_names labels \
--enable_lora
but get the following error when attempting to train:
Traceback (most recent call last):
File "/home/wentao/shiqi/dp-transformers/examples/nlg-reddit/sample-level-dp/fine-tune-dp.py", line 146, in
main(Arguments(train=train_args, privacy=privacy_args, model=model_args, lora=lora_args)) File "/home/wentao/shiqi/dp-transformers/examples/nlg-reddit/sample-level-dp/fine-tune-dp.py", line 134, in main trainer.train() File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/transformers/trainer.py", line 1851, in _inner_training_loop self.control = self.callback_handler.on_step_begin(args, self.state, self.control) File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/transformers/trainer_callback.py", line 386, in on_step_begin return self.call_event("on_step_begin", args, state, control) File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/transformers/trainer_callback.py", line 414, in call_event result = getattr(callback, event)( File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/dp_transformers/dp_utils.py", line 61, in on_step_begin optimizer.signal_skip_step(do_skip=False) AttributeError: 'AcceleratedOptimizer' object has no attribute 'signal_skip_step' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6753) of binary: /home/wentao/anaconda3/envs/dp-transformers/bin/python Traceback (most recent call last): File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/torch/distributed/run.py", line 765, in main() File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wentao/anaconda3/envs/dp-transformers/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I changed the dataset to ag_news since reddit is too big. Can you suggest what is the issue? Also after I activate the environment and installed the dp_transformers library, it will reinstall torch-1.12.1, but it is incompatible with peft:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. triton 2.0.0 requires cmake, which is not installed. triton 2.0.0 requires lit, which is not installed. peft 0.4.0 requires torch>=1.13.0, but you have torch 1.12.1 which is incompatible. Successfully installed dp_transformers-1.0.0 functorch-0.2.1 opacus-1.3.0 torch-1.12.1
and here are my environment versions:
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
accelerate 0.21.0 pypi_0 pypi aiohttp 3.8.5 pypi_0 pypi aiosignal 1.3.1 pypi_0 pypi alembic 1.11.2 pypi_0 pypi async-timeout 4.0.3 pypi_0 pypi attrs 23.1.0 pypi_0 pypi azure-common 1.1.28 pypi_0 pypi azure-core 1.29.2 pypi_0 pypi azure-identity 1.14.0 pypi_0 pypi azure-mgmt-core 1.4.0 pypi_0 pypi azure-storage-blob 12.13.0 pypi_0 pypi azureml-mlflow 1.52.0 pypi_0 pypi blas 1.0 mkl
blinker 1.6.2 pypi_0 pypi bzip2 1.0.8 h5eee18b_6
ca-certificates 2023.05.30 h06a4308_0
certifi 2023.7.22 pypi_0 pypi cffi 1.15.1 pypi_0 pypi charset-normalizer 3.2.0 pypi_0 pypi click 8.1.6 pypi_0 pypi cloudpickle 2.2.1 pypi_0 pypi contourpy 1.1.0 pypi_0 pypi cryptography 41.0.3 pypi_0 pypi cuda-cudart 11.8.89 0 nvidia cuda-cupti 11.8.87 0 nvidia cuda-libraries 11.8.0 0 nvidia cuda-nvrtc 11.8.89 0 nvidia cuda-nvtx 11.8.86 0 nvidia cuda-runtime 11.8.0 0 nvidia cycler 0.11.0 pypi_0 pypi databricks-cli 0.17.7 pypi_0 pypi datasets 2.14.4 pypi_0 pypi dill 0.3.7 pypi_0 pypi docker 6.1.3 pypi_0 pypi dp-transformers 1.0.0 pypi_0 pypi entrypoints 0.4 pypi_0 pypi exceptiongroup 1.1.3 pypi_0 pypi filelock 3.9.0 py310h06a4308_0
flask 2.3.2 pypi_0 pypi fonttools 4.42.0 pypi_0 pypi frozenlist 1.4.0 pypi_0 pypi fsspec 2023.6.0 pypi_0 pypi functorch 0.2.1 pypi_0 pypi gitdb 4.0.10 pypi_0 pypi gitpython 3.1.32 pypi_0 pypi gmp 6.2.1 h295c915_3
gmpy2 2.1.2 py310heeb90bb_0
greenlet 2.0.2 pypi_0 pypi gunicorn 21.2.0 pypi_0 pypi huggingface-hub 0.19.4 pypi_0 pypi idna 3.4 pypi_0 pypi importlib-metadata 6.8.0 pypi_0 pypi iniconfig 2.0.0 pypi_0 pypi intel-openmp 2023.1.0 hdb19cb5_46306
isodate 0.6.1 pypi_0 pypi itsdangerous 2.1.2 pypi_0 pypi jinja2 3.1.2 py310h06a4308_0
joblib 1.3.2 pypi_0 pypi jsonpickle 3.0.2 pypi_0 pypi kiwisolver 1.4.4 pypi_0 pypi ld_impl_linux-64 2.38 h1181459_1
libcublas 11.11.3.6 0 nvidia libcufft 10.9.0.58 0 nvidia libcufile 1.7.1.12 0 nvidia libcurand 10.3.3.129 0 nvidia libcusolver 11.4.1.48 0 nvidia libcusparse 11.7.5.86 0 nvidia libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libnpp 11.8.0.86 0 nvidia libnvjpeg 11.9.0.86 0 nvidia libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
mako 1.2.4 pypi_0 pypi markdown 3.4.4 pypi_0 pypi markupsafe 2.1.1 py310h7f8727e_0
matplotlib 3.7.2 pypi_0 pypi mkl 2023.1.0 h213fc3f_46344
mlflow 2.6.0 pypi_0 pypi mlflow-skinny 2.6.0 pypi_0 pypi mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.3.0 py310h06a4308_0
msal 1.23.0 pypi_0 pypi msal-extensions 1.0.0 pypi_0 pypi msrest 0.7.1 pypi_0 pypi multidict 6.0.4 pypi_0 pypi multiprocess 0.70.15 pypi_0 pypi ncurses 6.4 h6a678d5_0
networkx 3.1 py310h06a4308_0
numpy 1.25.2 pypi_0 pypi oauthlib 3.2.2 pypi_0 pypi opacus 1.3.0 pypi_0 pypi openssl 3.0.10 h7f8727e_2
opt-einsum 3.3.0 pypi_0 pypi packaging 23.1 pypi_0 pypi pandas 2.0.3 pypi_0 pypi peft 0.4.0 pypi_0 pypi pillow 10.0.0 pypi_0 pypi pip 23.2.1 py310h06a4308_0
pluggy 1.2.0 pypi_0 pypi portalocker 2.7.0 pypi_0 pypi protobuf 4.24.0 pypi_0 pypi prv-accountant 0.1.1.post1 pypi_0 pypi psutil 5.9.5 pypi_0 pypi pyarrow 12.0.1 pypi_0 pypi pycparser 2.21 pypi_0 pypi pyjwt 2.8.0 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi pytest 7.4.0 pypi_0 pypi python 3.10.12 h955ad1f_0
python-dateutil 2.8.2 pypi_0 pypi pytorch-cuda 11.8 h7e8668a_5 pytorch pytorch-mutex 1.0 cuda pytorch pytz 2023.3 pypi_0 pypi pyyaml 6.0.1 pypi_0 pypi querystring-parser 1.2.4 pypi_0 pypi readline 8.2 h5eee18b_0
regex 2023.8.8 pypi_0 pypi requests 2.31.0 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi safetensors 0.3.2 pypi_0 pypi scikit-learn 1.3.0 pypi_0 pypi scipy 1.11.1 pypi_0 pypi setuptools 68.0.0 py310h06a4308_0
six 1.16.0 pypi_0 pypi smmap 5.0.0 pypi_0 pypi sqlalchemy 2.0.19 pypi_0 pypi sqlite 3.41.2 h5eee18b_0
sqlparse 0.4.4 pypi_0 pypi sympy 1.11.1 py310h06a4308_0
tabulate 0.9.0 pypi_0 pypi tbb 2021.8.0 hdb19cb5_0
threadpoolctl 3.2.0 pypi_0 pypi tk 8.6.12 h1ccaba5_0
tokenizers 0.15.0 pypi_0 pypi tomli 2.0.1 pypi_0 pypi torch 1.12.1 pypi_0 pypi torchtriton 2.0.0 py310 pytorch tqdm 4.66.1 pypi_0 pypi transformers 4.36.1 pypi_0 pypi typing_extensions 4.7.1 py310h06a4308_0
tzdata 2023.3 pypi_0 pypi urllib3 1.26.16 pypi_0 pypi websocket-client 1.6.1 pypi_0 pypi werkzeug 2.3.7 pypi_0 pypi wheel 0.38.4 py310h06a4308_0
xxhash 3.3.0 pypi_0 pypi xz 5.4.2 h5eee18b_0
yarl 1.9.2 pypi_0 pypi zipp 3.16.2 pypi_0 pypi zlib 1.2.13 h5eee18b_1
Thank you so much!
Hi @ooolivia2333, apologies for late response :/ Which version of our repo you are currently using? I see that the error is from on_step_begin
but I removed this a while ago (https://github.com/microsoft/dp-transformers/commit/0358f699be12e8503251e131c6fbf25590cf35eb) and currently in dp_utils.py
we don't have on_step_begin
. I think if you use the latest version of our repo, you should not encounter this issue. Let us know if you have any further issues please.
Hi @ooolivia2333, apologies for late response :/ Which version of our repo you are currently using? I see that the error is from
on_step_begin
but I removed this a while ago (0358f69) and currently indp_utils.py
we don't haveon_step_begin
. I think if you use the latest version of our repo, you should not encounter this issue. Let us know if you have any further issues please.
Thanks for your reply! I am attempting to reinstall dp_transformers with peft, however I encountered the following error:
The conflict is caused by:
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.11.1 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.11.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.10.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.9.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.8.2 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.8.1 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.8.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.7.1 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.7.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.6.2 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.6.1 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.6.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.5.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.4.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.3.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.2.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.1.0 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.0.2 depends on torch>=1.13.0
dp-transformers 1.0.0 depends on torch<=1.12.1 and >=1.9.1
peft 0.0.1 depends on torch>=1.13.0
Can you suggest me what versions I should be using?
I think you can use latest version of our repo (which is 1.0.1) by cloning the repository and installing with pip install .
-- you can see that in the latest version it should not lead to such issues because we have torch>=1.13.1
https://github.com/microsoft/dp-transformers/blob/main/setup.py
I ran the example given
but got these
here is my environment
Could anyone help me with this?