Closed FryLcm closed 7 months ago
It runs normally when differential privacy is not turned on, but this error occurs as soon as DP is turned on.
Hi, Im seeing the same problem with you, can you please show me how to turn off DP ? Im using text-to-image-lora script for stable diffusion
For Opacus, we need the full model to be on the same device (for one sample). In other words, we do not support model slicing to different machines since we need to clip per sample gradient. We only allow batch slicing across different devices. Could you check whether this is the case for your code?
Hi, I am experiencing the same issue but there is a twist for me: For 1 random seed, the code works without a hitch but for another it yields this error. Why does seeding affect whether I see the error or not?
I have the same issue running the lora script of diffusers, did you find a solution? Im using Linux Ubuntu 20 with an RTX 3090, I get this error: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)"
Could anyone share the code (using our template)? There is very little we can do without seeing your code. Thanks!
I can't provide you with code that can reproduce the error (it's prohibited and convoluted) but here's the snippet of full error:
Traceback (most recent call last):
File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 449, in <module>
main()
File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 26, in main
learner.run()
File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 178, in run
self.run_lira(
File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 322, in run_lira
accuracy, eps = self.train_test(
File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 201, in train_test
self.eps, self.delta = self.fine_tune_batch(model=model, train_loader=train_loader)
File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 257, in fine_tune_batch
optimizer.step()
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 513, in step
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 494, in pre_step
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 412, in clip_and_accumulate
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 507, in contract
return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 573, in _core_contract
new_view = _tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend)
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/sharing.py", line 131, in cached_tensordot
return tensordot(x, y, axes, backend=backend)
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 374, in _tensordot
return fn(x, y, axes=axes)
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/backends/torch.py", line 54, in tensordot
return torch.tensordot(x, y, dims=axes)
File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/torch/functional.py", line 1100, in tensordot
return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
My friend suggested it could be an issue with the PyTorch version (I am using 2.0.0).
I found a solution in my case, very simple,
for epoch in range(first_epoch, args.num_train_epochs):
unet.to("cuda")
unet.train()
I added a to.("cuda") to the unet model before the .train()
and that fixed it, it works now
this is in the train_text_to_image_lora.py
hey @javismiles could you share your code, or at least the logic where you add opacus to train_text_to_image_lora.py (https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)? Thanks!
Hi, I also got the same error. I have double checked that both the data, targets and the model are on the same GPU.
I figured out the issue a couple of days ago. I could narrow down the error to this function in opacus/optimizer.py
called [clip_and_accumulate
] and if you go to line 399 (https://github.com/pytorch/opacus/blob/main/opacus/optimizers/optimizer.py#L399) you'll find that in case of an empty batch, the per_sample_clip_factor is initialised as torch. zeros((0,))
, an empty tensor that is NOT on the GPU. You'll need to change that line of code to ensure that this zero-tensor is also on the same device as the empty batch (which even though its empty is still on GPU).
I had the exact same issue. @gauriprdhn thank you so much for pointing it out.
One possible solution is modifying per_sample_clip_factor = torch.zeros((0,))
into
per_sample_clip_factor = torch.zeros((0,), device=self.grad_samples[0].device)
Thanks all for valuable feedback and comments. Will launch a fix soon (special thanks to @gauriprdhn ! Please lmk if you want to submit a PR by yourself).
Closed the issue, since we launched a fix in PR #631.
if you use docker, and last error is about linear.py will include in /opt/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py line 104: self.device = device and will change line 116 return F.linear(input, self.weight, self.bias) to: return F.linear(input.to(self.device), self.weight.to(self.device), self.bias.to(self.device))
File "C:\Users\zzx\Desktop\PFL-Non-IID-231119\system\flcore\clients\clientavg.py", line 45, in train self.optimizer.step() File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 513, in step if self.pre_step(): File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 494, in pre_step self.clip_and_accumulate() File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 412, in clip_and_accumulate grad = contract("i,i...", per_sample_clip_factor, grad_sample) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 507, in contract return _core_contract(operands, contraction_list, backend=backend, *einsum_kwargs) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 573, in _core_contract new_view = _tensordot(tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\sharing.py", line 131, in cached_tensordot return tensordot(x, y, axes, backend=backend) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 374, in _tensordot return fn(x, y, axes=axes) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\backends\torch.py", line 54, in tensordot return torch.tensordot(x, y, dims=axes) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\torch\functional.py", line 1193, in tensordot return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
进程已结束,退出代码1