pytorch / opacus

Training PyTorch models with differential privacy
https://opacus.ai
Apache License 2.0
1.65k stars 328 forks source link

Error for capture_activations_hook in grad_sample_module.py #608

Closed conjurer-Fan-Wu closed 7 months ago

conjurer-Fan-Wu commented 8 months ago

🐛 Bug

Error seems to be in the opacus library File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook p._forward_counter += 1 AttributeError: 'Parameter' object has no attribute '_forward_counter'

Please reproduce using our template Colab and post here the link

I use Google Drive to store the files. federated_main.py is the main file when I run with spyders. All py files are in src v3 filefolder

https://drive.google.com/drive/folders/1inWFXO0fPoKygi8rJSzUcJLr-jFVoLxb?usp=sharing

To Reproduce

:warning: We cannot help you without you sharing reproducible code. Do not ignore this part :) Steps to reproduce the behavior:

  1. Run federated_main directly

Traceback (most recent call last):

File /usr/local/lib/python3.10/dist-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:237 model0, optimizer0, train_loader = privacy_engine.make_private(

TypeError: PrivacyEngine.make_private() missing 1 required keyword-only argument: 'data_loader'

runfile('/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py', wdir='/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3') Reloaded modules: options, update, models, sampling, utils

Experimental details: Model : cnn Optimizer : sgd Learning : 0.01 Global Rounds : 2

Federated parameters:
IID
Fraction of users  : 0.9
Local Batch size   : 64
Local Epochs       : 5

global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) 0%| | 0/2 [00:00<?, ?it/s] | Global Training Round : 1 |

/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/update.py:25: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). return torch.tensor(image), torch.tensor(label) 0%| | 0/2 [00:01<?, ?it/s] Traceback (most recent call last):

File /usr/local/lib/python3.10/dist-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:245 w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:79 in update_weights log_probs = model(images)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1568 in _call_impl result = forward_call(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:148 in forward return self._module(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527 in _call_impl return forward_call(*args, **kwargs)

File ~/work/pyproject/basictest/FL_testmine/src_v3/models.py:49 in forward x = F.relu(self.conv1(x)) # -> [B, 16, 14, 14]

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1581 in _call_impl hook_result = hook(self, args, result)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook p._forward_counter += 1

AttributeError: 'Parameter' object has no attribute '_forward_counter'

Expected behavior

At least the program should normally run.

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:


wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
By the way, I use ubuntu 22.04 with python3 3.10.12 and opacus 1.4.0

[pip3] flake8==6.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] numpydoc==1.5.0
[pip3] torch==2.1.0
[pip3] torchinfo==1.8.0
[pip3] torchvision==0.15.2
[pip3] triton==2.1.0
[conda] Could not collect

## Additional context

<!-- Add any other context about the problem here. -->
HuanyuZhang commented 8 months ago

It seems your model has not been successfully instantiated by "make_private". Thus, the "_forward_counter" has not been defined (https://github.com/pytorch/opacus/blob/95df0904ae5d2b3aaa26b708e5067e9271624036/opacus/grad_sample/gsm_base.py#L67). Furthermore, the error message shows "TypeError: PrivacyEngine.make_private() missing 1 required keyword-only argument: 'data_loader'", which might be the reason for the failed instantiation. Could you please fix that part first? thanks!

conjurer-Fan-Wu commented 7 months ago

I have tested the code again, and eliminated the dataloader problem. But the above problem still exists.

##################################

runfile('/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py', wdir='/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3')

Experimental details: Model : cnn Optimizer : sgd Learning : 0.01 Global Rounds : 2

Federated parameters:
IID
Fraction of users  : 0.9
Local Batch size   : 64
Local Epochs       : 5

global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) 0%| | 0/2 [00:00<?, ?it/s] | Global Training Round : 2 |

/home/fanwu/.local/lib/python3.10/site-packages/opacus/privacy_engine.py:142: UserWarning: Secure RNG turned off. This is perfectly fine for experimentation as it allows for much faster training performance, but remember to turn it on and retrain one last time before production with secure_mode turned on. warnings.warn( /home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/update.py:25: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). return torch.tensor(image), torch.tensor(label) 0%| | 0/2 [00:00<?, ?it/s] Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:244 w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:79 in update_weights log_probs = model(images)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1568 in _call_impl result = forward_call(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:148 in forward return self._module(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527 in _call_impl return forward_call(*args, **kwargs)

File ~/work/pyproject/basictest/FL_testmine/src_v3/models.py:49 in forward x = F.relu(self.conv1(x)) # -> [B, 16, 14, 14]

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1581 in _call_impl hook_result = hook(self, args, result)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook p._forward_counter += 1

AttributeError: 'Parameter' object has no attribute '_forward_counter'

HuanyuZhang commented 7 months ago

This thread (https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049/2) should solve this issue. Please lmk whether it works :)

conjurer-Fan-Wu commented 7 months ago

No, that does not work.

Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:246 w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:76 in update_weights model = GradSampleModule(model)

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:141 in init self.add_hooks(

File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:191 in add_hooks raise ValueError("Trying to add hooks twice to the same model")

ValueError: Trying to add hooks twice to the same model

HuanyuZhang commented 7 months ago

Could you link the latest code (did not find it in your drive)? From the error, it seems somehow you try to privatize a model that has already been privatized before. Possibly you forget to unprivatize the model at the end of client training (self.model = model.to_standard_module()).

conjurer-Fan-Wu commented 7 months ago

Sorry for the late file update. Now the files are in the google drive: https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1

I change the update.py, with adding model = GradSampleModule(model) in update_weights function, before the training begins. And the error happens. At least such change should have a little variation. To tell the truth, since I used the code based on FedAvg(https://github.com/AshwinRJ/Federated-Learning-PyTorch), and I have felt that the constructure is very different from the example of opacus. I tried several days and all of changes were failed.

HuanyuZhang commented 7 months ago

Any reason not to have "model.to_standard_module()", as suggested by (https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049/2)? Note that this code reverts the privatized model to non-private model, avoiding privatizing the same model twice.

As I mentioned, the reason you see this hook error is that you are privatizing the same model for two times, thus adding the same hook twice.

My suggestion is as follows:

  1. Remove privacy_engine.make_private in federated_main.py and move it to update.py .
  2. Remove GradSampleModule in update.py.
  3. In update.py, instead of "return model.state_dict()", have "return model.to_standard_module().state_dict()"

Generally speaking, what you need to do is

  1. On the server side, you only keep non-private models. Therefore, you have the freedom to change model parameters by aggregation.
  2. On the client side, the client firstly receives the non-private model, then call the privacy engine to privatize the model and run DP-SGD. Finally, the client returns the model parameters (of the non-private model).
conjurer-Fan-Wu commented 7 months ago

Thanks for your kind response. I think I understand the architecture a little. I modified the code according to your help. (https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1) However, a new problem happens. I have not found the difference from the example in github. Module and optimizer construction is same as the example, but the error exists.

Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:235 w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:67 in update_weights model, optimizer, train_loader = privacy_engine.make_private(

File ~/.local/lib/python3.10/site-packages/opacus/privacy_engine.py:393 in make_private raise ValueError(

ValueError: Module parameters are different than optimizer Parameters

HuanyuZhang commented 7 months ago

Maybe you can define a new optimizer in "update.py", instead of re-using the existing one. One example is "optimizer = torch.optim.SGD(model.parameters(),lr=0.01,momentum=0,weight_decay=0)" in "FederatedLearningClient.py" in https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049

conjurer-Fan-Wu commented 7 months ago

Thanks for your response. I have changed the code as you said. However, the problem is still from the opacus library:

Traceback (most recent call last):

File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)

File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:235 w, loss, epsilon_idx = local_model.update_weights(args=args,

File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:80 in update_weights epsilon = privacy_engine.accountant.get_epsilon(delta=args.delta)

File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:97 in get_epsilon dprv = self._get_dprv(eps_error=eps_error, delta_error=delta_error)

File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:114 in _get_dprv domain = self._get_domain(

File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:150 in _get_domain return Domain.create_aligned(-L, L, mesh_size)

File ~/.local/lib/python3.10/site-packages/opacus/accountants/analysis/prv/domain.py:31 in create_aligned size = int(np.round((t_max - t_min) / dt)) + 1

ValueError: cannot convert float NaN to integer

HuanyuZhang commented 7 months ago

What is the delta value you are using? It is possible the delta value is too small. For PRV, we only support the case when delta > 1e-6.

Another potential fix is that you can move "privacy_engine.accountant.get_epsilon" to the end of loop. This can avoid the case where in the first iteration, the accountant fetches epsilon before the model gets updated.

conjurer-Fan-Wu commented 7 months ago

Thanks for your patient help. I modified the code according to your suggestion: move "privacy_engine.accountant.get_epsilon" to the end of loop. (https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1)

All the parameter values are same as the example of mnist for opacus. But when the program is running, the loss in each epoch turns to be minus number quickly, without any convergence. I check the whole process again. But I do not know why the problem happens. I tried to change lr = 0.05 or 0.01, however, they are no use.

HuanyuZhang commented 7 months ago

There are many possibilities for a loss to be negative. For example, the input of NLLLOSS should be a probability (0 to 1)(https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html). However, it might not be the case from reading your model setup.

Since the original error was not a "bug", and we are pivoting away the topic from Opacus, I just close the issue.