Closed conjurer-Fan-Wu closed 7 months ago
It seems your model has not been successfully instantiated by "make_private". Thus, the "_forward_counter" has not been defined (https://github.com/pytorch/opacus/blob/95df0904ae5d2b3aaa26b708e5067e9271624036/opacus/grad_sample/gsm_base.py#L67). Furthermore, the error message shows "TypeError: PrivacyEngine.make_private() missing 1 required keyword-only argument: 'data_loader'", which might be the reason for the failed instantiation. Could you please fix that part first? thanks!
I have tested the code again, and eliminated the dataloader problem. But the above problem still exists.
##################################
runfile('/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py', wdir='/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3')
Experimental details: Model : cnn Optimizer : sgd Learning : 0.01 Global Rounds : 2
Federated parameters:
IID
Fraction of users : 0.9
Local Batch size : 64
Local Epochs : 5
global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) 0%| | 0/2 [00:00<?, ?it/s] | Global Training Round : 2 |
/home/fanwu/.local/lib/python3.10/site-packages/opacus/privacy_engine.py:142: UserWarning: Secure RNG turned off. This is perfectly fine for experimentation as it allows for much faster training performance, but remember to turn it on and retrain one last time before production with secure_mode
turned on.
warnings.warn(
/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/update.py:25: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor).
return torch.tensor(image), torch.tensor(label)
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)
File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:244 w, loss, epsilon_idx = local_model.update_weights(args=args,
File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:79 in update_weights log_probs = model(images)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1568 in _call_impl result = forward_call(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:148 in forward return self._module(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527 in _call_impl return forward_call(*args, **kwargs)
File ~/work/pyproject/basictest/FL_testmine/src_v3/models.py:49 in forward x = F.relu(self.conv1(x)) # -> [B, 16, 14, 14]
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1581 in _call_impl hook_result = hook(self, args, result)
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook p._forward_counter += 1
AttributeError: 'Parameter' object has no attribute '_forward_counter'
This thread (https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049/2) should solve this issue. Please lmk whether it works :)
No, that does not work.
Traceback (most recent call last):
File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)
File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:246 w, loss, epsilon_idx = local_model.update_weights(args=args,
File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:76 in update_weights model = GradSampleModule(model)
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:141 in init self.add_hooks(
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:191 in add_hooks raise ValueError("Trying to add hooks twice to the same model")
ValueError: Trying to add hooks twice to the same model
Could you link the latest code (did not find it in your drive)? From the error, it seems somehow you try to privatize a model that has already been privatized before. Possibly you forget to unprivatize the model at the end of client training (self.model = model.to_standard_module()).
Sorry for the late file update. Now the files are in the google drive: https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1
I change the update.py, with adding model = GradSampleModule(model) in update_weights function, before the training begins. And the error happens. At least such change should have a little variation. To tell the truth, since I used the code based on FedAvg(https://github.com/AshwinRJ/Federated-Learning-PyTorch), and I have felt that the constructure is very different from the example of opacus. I tried several days and all of changes were failed.
Any reason not to have "model.to_standard_module()", as suggested by (https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049/2)? Note that this code reverts the privatized model to non-private model, avoiding privatizing the same model twice.
As I mentioned, the reason you see this hook error is that you are privatizing the same model for two times, thus adding the same hook twice.
My suggestion is as follows:
Generally speaking, what you need to do is
Thanks for your kind response. I think I understand the architecture a little. I modified the code according to your help. (https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1) However, a new problem happens. I have not found the difference from the example in github. Module and optimizer construction is same as the example, but the error exists.
Traceback (most recent call last):
File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)
File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:235 w, loss, epsilon_idx = local_model.update_weights(args=args,
File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:67 in update_weights model, optimizer, train_loader = privacy_engine.make_private(
File ~/.local/lib/python3.10/site-packages/opacus/privacy_engine.py:393 in make_private raise ValueError(
ValueError: Module parameters are different than optimizer Parameters
Maybe you can define a new optimizer in "update.py", instead of re-using the existing one. One example is "optimizer = torch.optim.SGD(model.parameters(),lr=0.01,momentum=0,weight_decay=0)" in "FederatedLearningClient.py" in https://discuss.pytorch.org/t/error-when-trying-federated-learning-with-opacus/153049
Thanks for your response. I have changed the code as you said. However, the problem is still from the opacus library:
Traceback (most recent call last):
File ~/.local/lib/python3.10/site-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)
File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:235 w, loss, epsilon_idx = local_model.update_weights(args=args,
File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:80 in update_weights epsilon = privacy_engine.accountant.get_epsilon(delta=args.delta)
File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:97 in get_epsilon dprv = self._get_dprv(eps_error=eps_error, delta_error=delta_error)
File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:114 in _get_dprv domain = self._get_domain(
File ~/.local/lib/python3.10/site-packages/opacus/accountants/prv.py:150 in _get_domain return Domain.create_aligned(-L, L, mesh_size)
File ~/.local/lib/python3.10/site-packages/opacus/accountants/analysis/prv/domain.py:31 in create_aligned size = int(np.round((t_max - t_min) / dt)) + 1
ValueError: cannot convert float NaN to integer
What is the delta value you are using? It is possible the delta value is too small. For PRV, we only support the case when delta > 1e-6.
Another potential fix is that you can move "privacy_engine.accountant.get_epsilon" to the end of loop. This can avoid the case where in the first iteration, the accountant fetches epsilon before the model gets updated.
Thanks for your patient help. I modified the code according to your suggestion: move "privacy_engine.accountant.get_epsilon" to the end of loop. (https://drive.google.com/drive/folders/1hxmZZzZtKZ78ohYmHx41OC_DugFm0Zv1)
All the parameter values are same as the example of mnist for opacus. But when the program is running, the loss in each epoch turns to be minus number quickly, without any convergence. I check the whole process again. But I do not know why the problem happens. I tried to change lr = 0.05 or 0.01, however, they are no use.
There are many possibilities for a loss to be negative. For example, the input of NLLLOSS should be a probability (0 to 1)(https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html). However, it might not be the case from reading your model setup.
Since the original error was not a "bug", and we are pivoting away the topic from Opacus, I just close the issue.
🐛 Bug
Error seems to be in the opacus library File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook p._forward_counter += 1 AttributeError: 'Parameter' object has no attribute '_forward_counter'
Please reproduce using our template Colab and post here the link
I use Google Drive to store the files. federated_main.py is the main file when I run with spyders. All py files are in src v3 filefolder
https://drive.google.com/drive/folders/1inWFXO0fPoKygi8rJSzUcJLr-jFVoLxb?usp=sharing
To Reproduce
Traceback (most recent call last):
File /usr/local/lib/python3.10/dist-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)
File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:237 model0, optimizer0, train_loader = privacy_engine.make_private(
TypeError: PrivacyEngine.make_private() missing 1 required keyword-only argument: 'data_loader'
runfile('/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py', wdir='/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3') Reloaded modules: options, update, models, sampling, utils
Experimental details: Model : cnn Optimizer : sgd Learning : 0.01 Global Rounds : 2
global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) global model: CNNMnist( (conv1): Conv2d(1, 16, kernel_size=(8, 8), stride=(2, 2), padding=(3, 3)) (conv2): Conv2d(16, 32, kernel_size=(4, 4), stride=(2, 2)) (fc1): Linear(in_features=512, out_features=32, bias=True) (fc2): Linear(in_features=32, out_features=10, bias=True) ) 0%| | 0/2 [00:00<?, ?it/s] | Global Training Round : 1 |
/home/fanwu/work/pyproject/basictest/FL_testmine/src_v3/update.py:25: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requiresgrad(True), rather than torch.tensor(sourceTensor). return torch.tensor(image), torch.tensor(label) 0%| | 0/2 [00:01<?, ?it/s] Traceback (most recent call last):
File /usr/local/lib/python3.10/dist-packages/spyder_kernels/py3compat.py:356 in compat_exec exec(code, globals, locals)
File ~/work/pyproject/basictest/FL_testmine/src_v3/federated_main.py:245 w, loss, epsilon_idx = local_model.update_weights(args=args,
File ~/work/pyproject/basictest/FL_testmine/src_v3/update.py:79 in update_weights log_probs = model(images)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1568 in _call_impl result = forward_call(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:148 in forward return self._module(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527 in _call_impl return forward_call(*args, **kwargs)
File ~/work/pyproject/basictest/FL_testmine/src_v3/models.py:49 in forward x = F.relu(self.conv1(x)) # -> [B, 16, 14, 14]
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518 in _wrapped_call_impl return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1581 in _call_impl hook_result = hook(self, args, result)
File ~/.local/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py:288 in capture_activations_hook p._forward_counter += 1
AttributeError: 'Parameter' object has no attribute '_forward_counter'
Expected behavior
At least the program should normally run.
Environment
Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with: