securefederatedai / openfl

An open framework for Federated Learning.
https://openfl.readthedocs.io/en/latest/index.html
Apache License 2.0
677 stars 180 forks source link

UnboundLocalError: local variable 'request' referenced before assignment #492

Open CasellaJr opened 1 year ago

CasellaJr commented 1 year ago

For my experiments on CIFAR-10, until now I have used ResNet-18 or an EfficientNet. In the jupyter notebook, I initialise these models in this way:

resnet18 = torchvision.models.resnet18(pretrained=False)
efficientnet_b0 = torchvision.models.efficientnet_b0(pretrained=False)

Then they works without problems, also even if I change normalization layers, number of classes and so on, no problem. The problem is when I use a different model, like a VGG16: vgg16 = torchvision.models.vgg16(pretrained=False) then I have the following error:

                        if request.model_proto:
                    UnboundLocalError: local variable 'request' referenced before assignment

This is the complete log of the error:

[18:18:37] INFO     🧿 Starting the Director Service.                                                                                             director.py:50
           INFO     Sample shape: ['32', '32', '3'], target shape: ['32', '32', '3']                                                              director.py:74
           INFO     Starting server on 0.0.0.0:50051                                                                                      director_server.py:114
                    '32', '3']}, 'is_online': True, 'is_experiment_running': False, 'valid_duration': 120, 'last_updated': 1661537603.6085486,
                    'experiment_name': None}, 'env_two': {'shard_info': {'node_info': {'name': 'env_two'}, 'shard_description': 'Cifar10
                    dataset, shard number 2 out of 10', 'sample_shape': ['32', '32', '3'], 'target_shape': ['32', '32', '3']}, 'is_online':
                    True, 'is_experiment_running': False, 'valid_duration': 120, 'last_updated': 1661537604.1536303, 'experiment_name': None},
                    'env_seven': {'shard_info': {'node_info': {'name': 'env_seven'}, 'shard_description': 'Cifar10 dataset, shard number 7 out
                    of 10', 'sample_shape': ['32', '32', '3'], 'target_shape': ['32', '32', '3']}, 'is_online': True, 'is_experiment_running':
                    False, 'valid_duration': 120, 'last_updated': 1661537605.1729639, 'experiment_name': None}, 'env_three': {'shard_info':
                    {'node_info': {'name': 'env_three'}, 'shard_description': 'Cifar10 dataset, shard number 3 out of 10', 'sample_shape':
                    ['32', '32', '3'], 'target_shape': ['32', '32', '3']}, 'is_online': True, 'is_experiment_running': False, 'valid_duration':
                    120, 'last_updated': 1661537605.725366, 'experiment_name': None}, 'env_eight': {'shard_info': {'node_info': {'name':
                    'env_eight'}, 'shard_description': 'Cifar10 dataset, shard number 8 out of 10', 'sample_shape': ['32', '32', '3'],
                    'target_shape': ['32', '32', '3']}, 'is_online': True, 'is_experiment_running': False, 'valid_duration': 120,
                    'last_updated': 1661537608.8607242, 'experiment_name': None}, 'env_ten': {'shard_info': {'node_info': {'name': 'env_ten'},
                    'shard_description': 'Cifar10 dataset, shard number 10 out of 10', 'sample_shape': ['32', '32', '3'], 'target_shape': ['32',
                    '32', '3']}, 'is_online': True, 'is_experiment_running': False, 'valid_duration': 120, 'last_updated': 1661537608.9310956,
                    'experiment_name': None}, 'env_nine': {'shard_info': {'node_info': {'name': 'env_nine'}, 'shard_description': 'Cifar10
                    dataset, shard number 9 out of 10', 'sample_shape': ['32', '32', '3'], 'target_shape': ['32', '32', '3']}, 'is_online':
                    True, 'is_experiment_running': False, 'valid_duration': 120, 'last_updated': 1661537609.453306, 'experiment_name': None}}
[18:15:11] INFO     SetNewExperiment request has got <grpc._cython.cygrpc._MessageReceiver object at 0x7f2b44dfbac0>                      director_server.py:132
[18:15:12] ERROR    Unexpected [UnboundLocalError] raised by servicer method [/openfl.director.Director/SetNewExperiment]                           events.py:81
                    Traceback (most recent call last):
                      File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 646, in grpc._cython.cygrpc._handle_exceptions
                      File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 759, in _handle_rpc
                      File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 602, in _handle_stream_unary_rpc
                      File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 368, in _finish_handler_with_unary_response
                      File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/transport/grpc/director_server.py", line 143, in
                    SetNewExperiment
                        if request.model_proto:
                    UnboundLocalError: local variable 'request' referenced before assignment

The error happens also if I create the VGG16 Class in this way:

class VGG16(nn.Module):

    def __init__(self, num_classes):
        super(VGG16, self).__init__()

        # calculate same padding:
        # (w - k + 2*p)/s + 1 = o
        # => p = (s(o-1) - w + k)/2

        self.block_1 = nn.Sequential(
            nn.Conv2d(in_channels=1,
                      out_channels=64,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      # (1(32-1)- 32 + 3)/2 = 1
                      padding=1),
            nn.BatchNorm2d(64, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=64,
                      out_channels=64,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(64, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=(2, 2),
                         stride=(2, 2))
        )

        self.block_2 = nn.Sequential(
            nn.Conv2d(in_channels=64,
                      out_channels=128,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(128, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=128,
                      out_channels=128,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(128, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=(2, 2),
                         stride=(2, 2))
        )

        self.block_3 = nn.Sequential(
            nn.Conv2d(in_channels=128,
                      out_channels=256,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(256, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=256,
                      out_channels=256,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(256, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=256,
                      out_channels=256,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(256, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=(2, 2),
                         stride=(2, 2))
        )

        self.block_4 = nn.Sequential(
            nn.Conv2d(in_channels=256,
                      out_channels=512,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(512, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=512,
                      out_channels=512,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(512, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=512,
                      out_channels=512,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(512, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=(2, 2),
                         stride=(2, 2))
        ) 

        self.block_5 = nn.Sequential(
            nn.Conv2d(in_channels=512,
                      out_channels=512,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(512, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=512,
                      out_channels=512,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(512, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.Conv2d(in_channels=512,
                      out_channels=512,
                      kernel_size=(3, 3),
                      stride=(1, 1),
                      padding=1),
            nn.BatchNorm2d(512, eps=1e-05, momentum=0.9, affine=True, track_running_stats=True),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=(2, 2),
                         stride=(2, 2))
        )        

        self.classifier = nn.Sequential(
            nn.Linear(512*7*7, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes) 
        )

        for m in self.modules():
            if isinstance(m, torch.nn.Conv2d) or isinstance(m, torch.nn.Linear):
                nn.init.kaiming_uniform_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
                if m.bias is not None:
                    m.bias.detach().zero_()

    def forward(self, x):

        x = self.block_1(x)
        x = self.block_2(x)
        x = self.block_3(x)
        x = self.block_4(x)
        x = self.block_5(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

vgg16 = VGG16(10)

I remember that some weeks ago, this error appeared also with other different networks I tried (maybe alexnet or inception, but I do not remember), however it was not important so I skipped the problem, but now I want to use some different network and I want to solve.

For the environment: I am using a real federation with 11 different machines.

igor-davidyuk commented 1 year ago

Let me try it myself

igor-davidyuk commented 1 year ago

I can confirm this. @CasellaJr could you please report if the error is also shown on the front end (in your interactive environment)

CasellaJr commented 1 year ago

The error is shown on the terminal (used to start director and envoys). In the jupyter notebook I remember that the cells stop in fl_experiment.start(...)

igor-davidyuk commented 1 year ago

I imagine, but what was the error?

CasellaJr commented 1 year ago

No error in the notebook, just the cell stops running, but there is no error

igor-davidyuk commented 1 year ago

Investigation showed that the model is just too big for client to transfer, it will be fixed, for now just try using models with layers under 512 Mb

CasellaJr commented 1 year ago

Yes, I have seen your PR. If that works, I can modify my python scripts as yours.

CasellaJr commented 1 year ago

News on this?

igor-davidyuk commented 1 year ago

News on this?

The PR was merged. The message length was increased up to 1 GB, I am not sure if this change is in PyPI already. Did you try installing OpenFL from source with pip install -e . and running the same experiment?

igor-davidyuk commented 1 year ago

Installing from the source will also help you to increase the message length even further as it did not find its way to user settings and should be changed here: https://github.com/intel/openfl/blob/develop/openfl/transport/grpc/grpc_channel_options.py

CasellaJr commented 1 year ago

I did not installed anything new :D I am at OpenFL 1.3. When I finish my experiments I will update to 1.5 to run a new set of experiments

igor-davidyuk commented 1 year ago

I did not installed anything new :D I am at OpenFL 1.3. When I finish my experiments I will update to 1.5 to run a new set of experiments

Ok, consider closing this issue when you ensure everything is working.

CasellaJr commented 1 year ago

Hello everyone I am just trying to use a VGG16 again. Now I have openfl with the above fix. Indeed, training starts, and I have no more the previous error if request.model_proto: UnboundLocalError: local variable 'request' referenced before assignment However, after few training rounds (20-30 more or less), I have this error:

Traceback (most recent call last):
  File "/usr/local/bin/fx", line 8, in <module>
    sys.exit(entry())
  File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 243, in entry
    error_handler(e)
  File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 186, in error_handler
    raise error
  File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 241, in entry
    cli()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/openfl/interface/collaborator.py", line 61, in start_
    plan.get_collaborator(collaborator_name).run()
  File "/usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 146, in run
    self.do_task(task, round_number)
  File "/usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 278, in do_task
    self.send_task_results(global_output_tensor_dict, round_number, task_name)
  File "/usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 433, in send_task_results
    self.client.send_local_task_results(
  File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_client.py", line 84, in wrapper
    response = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_client.py", line 95, in wrapper
    response = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_client.py", line 323, in send_local_task_results
    stream += utils.proto_to_datastream(request, self.logger)
  File "/usr/local/lib/python3.8/site-packages/openfl/protocols/utils.py", line 255, in proto_to_datastream
    chunk = npbytes[i: i + buffer_size]
MemoryError

in the collaborators, and the following in the aggregator:

ERROR    Exception calling application:                                                                                                _server.py:453
                    Traceback (most recent call last):
                      File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 443, in _call_behavior
                        response_or_iterator = behavior(argument, context)
                      File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 212, in SendLocalTaskResults
                        proto = utils.datastream_to_proto(proto, request)
                      File "/usr/local/lib/python3.8/site-packages/openfl/protocols/utils.py", line 228, in datastream_to_proto
                        npbytes += chunk.npbytes
                    MemoryError