pytorch / opacus

Training PyTorch models with differential privacy
https://opacus.ai
Apache License 2.0
1.68k stars 337 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm) #612

Closed FryLcm closed 7 months ago

FryLcm commented 10 months ago

File "C:\Users\zzx\Desktop\PFL-Non-IID-231119\system\flcore\clients\clientavg.py", line 45, in train self.optimizer.step() File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 513, in step if self.pre_step(): File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 494, in pre_step self.clip_and_accumulate() File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opacus\optimizers\optimizer.py", line 412, in clip_and_accumulate grad = contract("i,i...", per_sample_clip_factor, grad_sample) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 507, in contract return _core_contract(operands, contraction_list, backend=backend, *einsum_kwargs) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 573, in _core_contract new_view = _tensordot(tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\sharing.py", line 131, in cached_tensordot return tensordot(x, y, axes, backend=backend) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\contract.py", line 374, in _tensordot return fn(x, y, axes=axes) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\opt_einsum\backends\torch.py", line 54, in tensordot return torch.tensordot(x, y, dims=axes) File "C:\Users\zzx\anaconda3\envs\opacus\lib\site-packages\torch\functional.py", line 1193, in tensordot return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

进程已结束,退出代码1

FryLcm commented 10 months ago

It runs normally when differential privacy is not turned on, but this error occurs as soon as DP is turned on.

calibretaliation commented 10 months ago

Hi, Im seeing the same problem with you, can you please show me how to turn off DP ? Im using text-to-image-lora script for stable diffusion

HuanyuZhang commented 10 months ago

For Opacus, we need the full model to be on the same device (for one sample). In other words, we do not support model slicing to different machines since we need to clip per sample gradient. We only allow batch slicing across different devices. Could you check whether this is the case for your code?

gauriprdhn commented 10 months ago

Hi, I am experiencing the same issue but there is a twist for me: For 1 random seed, the code works without a hitch but for another it yields this error. Why does seeding affect whether I see the error or not?

javismiles commented 10 months ago

I have the same issue running the lora script of diffusers, did you find a solution? Im using Linux Ubuntu 20 with an RTX 3090, I get this error: "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)"

HuanyuZhang commented 10 months ago

Could anyone share the code (using our template)? There is very little we can do without seeing your code. Thanks!

gauriprdhn commented 10 months ago

I can't provide you with code that can reproduce the error (it's prohibited and convoluted) but here's the snippet of full error:

Traceback (most recent call last):
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 449, in <module>
    main()
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 26, in main
    learner.run()
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 178, in run
    self.run_lira(
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 322, in run_lira
    accuracy, eps = self.train_test(
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 201, in train_test
    self.eps, self.delta = self.fine_tune_batch(model=model, train_loader=train_loader)
  File "/projappl/project_2003275/gpradhan/dp_hp_tuning/src/train_lira.py", line 257, in fine_tune_batch
    optimizer.step()
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 513, in step
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 494, in pre_step
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opacus-1.4.1-py3.9.egg/opacus/optimizers/optimizer.py", line 412, in clip_and_accumulate
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 507, in contract
    return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 573, in _core_contract
    new_view = _tensordot(*tmp_operands, axes=(tuple(left_pos), tuple(right_pos)), backend=backend)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/sharing.py", line 131, in cached_tensordot
    return tensordot(x, y, axes, backend=backend)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/contract.py", line 374, in _tensordot
    return fn(x, y, axes=axes)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/opt_einsum/backends/torch.py", line 54, in tensordot
    return torch.tensordot(x, y, dims=axes)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.9/site-packages/torch/functional.py", line 1100, in tensordot
    return _VF.tensordot(a, b, dims_a, dims_b)  # type: ignore[attr-defined]
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

My friend suggested it could be an issue with the PyTorch version (I am using 2.0.0).

javismiles commented 10 months ago

I found a solution in my case, very simple,

for epoch in range(first_epoch, args.num_train_epochs):
    unet.to("cuda")
    unet.train()

I added a to.("cuda") to the unet model before the .train()

and that fixed it, it works now

this is in the train_text_to_image_lora.py

HuanyuZhang commented 10 months ago

hey @javismiles could you share your code, or at least the logic where you add opacus to train_text_to_image_lora.py (https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)? Thanks!

vaibhav0195 commented 9 months ago

Hi, I also got the same error. I have double checked that both the data, targets and the model are on the same GPU.

gauriprdhn commented 9 months ago

I figured out the issue a couple of days ago. I could narrow down the error to this function in opacus/optimizer.py called [clip_and_accumulate] and if you go to line 399 (https://github.com/pytorch/opacus/blob/main/opacus/optimizers/optimizer.py#L399) you'll find that in case of an empty batch, the per_sample_clip_factor is initialised as torch. zeros((0,)), an empty tensor that is NOT on the GPU. You'll need to change that line of code to ensure that this zero-tensor is also on the same device as the empty batch (which even though its empty is still on GPU).

Tian99Yu commented 9 months ago

I had the exact same issue. @gauriprdhn thank you so much for pointing it out.

One possible solution is modifying per_sample_clip_factor = torch.zeros((0,)) into per_sample_clip_factor = torch.zeros((0,), device=self.grad_samples[0].device)

HuanyuZhang commented 9 months ago

Thanks all for valuable feedback and comments. Will launch a fix soon (special thanks to @gauriprdhn ! Please lmk if you want to submit a PR by yourself).

HuanyuZhang commented 7 months ago

Closed the issue, since we launched a fix in PR #631.

L7c8ana commented 4 months ago

if you use docker, and last error is about linear.py will include in /opt/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py line 104: self.device = device and will change line 116 return F.linear(input, self.weight, self.bias) to: return F.linear(input.to(self.device), self.weight.to(self.device), self.bias.to(self.device))