Open CasellaJr opened 1 year ago
Let me try it myself
I can confirm this. @CasellaJr could you please report if the error is also shown on the front end (in your interactive environment)
The error is shown on the terminal (used to start director and envoys). In the jupyter notebook I remember that the cells stop in fl_experiment.start(...)
I imagine, but what was the error?
No error in the notebook, just the cell stops running, but there is no error
Investigation showed that the model is just too big for client to transfer, it will be fixed, for now just try using models with layers under 512 Mb
Yes, I have seen your PR. If that works, I can modify my python scripts as yours.
News on this?
News on this?
The PR was merged. The message length was increased up to 1 GB, I am not sure if this change is in PyPI already.
Did you try installing OpenFL from source with pip install -e .
and running the same experiment?
Installing from the source will also help you to increase the message length even further as it did not find its way to user settings and should be changed here: https://github.com/intel/openfl/blob/develop/openfl/transport/grpc/grpc_channel_options.py
I did not installed anything new :D I am at OpenFL 1.3. When I finish my experiments I will update to 1.5 to run a new set of experiments
I did not installed anything new :D I am at OpenFL 1.3. When I finish my experiments I will update to 1.5 to run a new set of experiments
Ok, consider closing this issue when you ensure everything is working.
Hello everyone
I am just trying to use a VGG16 again. Now I have openfl with the above fix. Indeed, training starts, and I have no more the previous error if request.model_proto: UnboundLocalError: local variable 'request' referenced before assignment
However, after few training rounds (20-30 more or less), I have this error:
Traceback (most recent call last):
File "/usr/local/bin/fx", line 8, in <module>
sys.exit(entry())
File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 243, in entry
error_handler(e)
File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 186, in error_handler
raise error
File "/usr/local/lib/python3.8/site-packages/openfl/interface/cli.py", line 241, in entry
cli()
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1062, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/core.py", line 763, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/openfl/interface/collaborator.py", line 61, in start_
plan.get_collaborator(collaborator_name).run()
File "/usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 146, in run
self.do_task(task, round_number)
File "/usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 278, in do_task
self.send_task_results(global_output_tensor_dict, round_number, task_name)
File "/usr/local/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 433, in send_task_results
self.client.send_local_task_results(
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_client.py", line 84, in wrapper
response = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_client.py", line 95, in wrapper
response = func(self, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_client.py", line 323, in send_local_task_results
stream += utils.proto_to_datastream(request, self.logger)
File "/usr/local/lib/python3.8/site-packages/openfl/protocols/utils.py", line 255, in proto_to_datastream
chunk = npbytes[i: i + buffer_size]
MemoryError
in the collaborators, and the following in the aggregator:
ERROR Exception calling application: _server.py:453
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/grpc/_server.py", line 443, in _call_behavior
response_or_iterator = behavior(argument, context)
File "/usr/local/lib/python3.8/site-packages/openfl/transport/grpc/aggregator_server.py", line 212, in SendLocalTaskResults
proto = utils.datastream_to_proto(proto, request)
File "/usr/local/lib/python3.8/site-packages/openfl/protocols/utils.py", line 228, in datastream_to_proto
npbytes += chunk.npbytes
MemoryError
For my experiments on CIFAR-10, until now I have used ResNet-18 or an EfficientNet. In the jupyter notebook, I initialise these models in this way:
Then they works without problems, also even if I change normalization layers, number of classes and so on, no problem. The problem is when I use a different model, like a VGG16:
vgg16 = torchvision.models.vgg16(pretrained=False)
then I have the following error:This is the complete log of the error:
The error happens also if I create the VGG16 Class in this way:
I remember that some weeks ago, this error appeared also with other different networks I tried (maybe alexnet or inception, but I do not remember), however it was not important so I skipped the problem, but now I want to use some different network and I want to solve.
For the environment: I am using a real federation with 11 different machines.