Assert subkey == 'step'

CasellaJr commented 2 years ago

Hello. I have this error below when using SGD as optimizer. With Adam it work correctly.

optimizer = optim.Adam(params_to_update, lr=1e-4)
#optimizer = optim.AdamW(params_to_update, lr=0.001, weight_decay=0.02)
#optimizer = optim.SGD(params_to_update, lr=0.01)

Basically in this setting it work, but if I comment Adam and uncomment SGD then I have this error:

[11:07:13] INFO     Using Interactive Python API                                                                                             collaborator.py:237
           ERROR    Collaborator failed with error: :                                                                                                envoy.py:93
                    Traceback (most recent call last):
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/envoy/envoy.py", line 91, in run
                        self._run_collaborator()
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/envoy/envoy.py", line 164, in
                    _run_collaborator
                        col.run()
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 145,
                    in run
                        self.do_task(task, round_number)
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 255,
                    in do_task
                        global_output_tensor_dict, local_output_tensor_dict = func(
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/federated/task/task_runner.py", line 108, in
                    collaborator_adapted_task
                        self.rebuild_model(input_tensor_dict, validation=validation_flag, device=device)
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/federated/task/task_runner.py", line 229, in
                    rebuild_model
                        self.set_tensor_dict(input_tensor_dict, with_opt_vars=True, device=device)
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/federated/task/task_runner.py", line 381, in
                    set_tensor_dict
                        return self.framework_adapter.set_tensor_dict(*args, **kwargs)
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/plugins/frameworks_adapters/pytorch_adapter.py",
                    line 55, in set_tensor_dict
                        _set_optimizer_state(optimizer, device, tensor_dict)
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/plugins/frameworks_adapters/pytorch_adapter.py",
                    line 70, in _set_optimizer_state
                        temp_state_dict = expand_derived_opt_state_dict(
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/plugins/frameworks_adapters/pytorch_adapter.py",
                    line 236, in expand_derived_opt_state_dict
                        assert subkey == 'step'
                    AssertionError

ishant162 commented 1 year ago

Here is my analysis on the issue:

Issue Reproduction:

Using SGD as an optimizer in PyTorch_TinyImageNet Tutorial optimizer_SGD = optim.SGD(model.parameters(), lr=1e-1)
This issue is observed when we use torch >=1.8.0. In my case it is torch 1.13.1

Issue is observed with torch>=1.8.0 but not with torch <=1.7.1 torch<=1.7.1

By default the length of state dictionary is zero.
Inside method: _derive_opt_state_dict(opt_state_dict): len(opt_state_dict['state']) evaluates to 0 indicating that the optimizer is stateless.

Logs:

       INFO     #### opt_state_dict {'state': {}, 'param_groups': [{'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0,      pytorch_adapter.py:121
                'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': []}]} ####

       WARNING  tried to remove tensor: __opt_state_needed not present in the tensor dict                                                       utils.py:170

       INFO     #### opt_state_dict {'state': {}, 'param_groups': [{'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0,      pytorch_adapter.py:121
                'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': []}]} ####

torch>=1.8.0

The state dictionary is updated with momentum buffers which are set to None by default and the State dictionary becomes non-empty.
_derive_opt_state_dict(opt_state_dict): len(opt_state_dict['state']) evaluates to non-zero indicating that the optimizer has state.
This will lead to an assertion error(Assert subkey == 'step') as none of the subkeys is 'step'.

Logs:

      INFO     #### opt_state_dict {'state': {0: {'momentum_buffer': None}, 1: {'momentum_buffer': None}}, 'param_groups': [{'lr':   pytorch_adapter.py:120
                0.0001, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None,                            
                'differentiable': False, 'params': [0, 1]}]} ####           

       WARNING  tried to remove tensor: __opt_state_needed not present in the tensor dict                                                       utils.py:172

       INFO     #### opt_state_dict {'state': {0: {'momentum_buffer': None}, 1: {'momentum_buffer': None}}, 'param_groups': [{'lr':   pytorch_adapter.py:120
                0.0001, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None,                            
                'differentiable': False, 'params': [0, 1]}]} ####

Additional Information Issue not observed if momentum!=0

When we add momentum parameter in the optimizer definition. The state dict is updated with tensors.
In this case the state is needed for the optimizer and derived_opt_state_dict['__opt_state_needed'] = 'true' will be set and it will continue to work as expected.

Next Steps: I will raise a PR for fixing this issue.

CasellaJr commented 1 year ago

Thank you for the answer. So, until now we can fix only by changing the momentum?

ishant162 commented 1 year ago

Yes, for now we can fix this by giving a non-zero value to momentum

securefederatedai / openfl

Assert subkey == 'step' #490