globus-labs / FLoX-prototype

Python library for serverless Federated Learning experiments.
Apache License 2.0
14 stars 1 forks source link

Add support for PyTorch #15

Open nikita-kotsehub opened 2 years ago

nikita-kotsehub commented 2 years ago
nikita-kotsehub commented 1 year ago

started working on it in #23

nikita-kotsehub commented 1 year ago

added in #25

nikita-kotsehub commented 1 year ago

The current PyTorch example (flox/examples/quickstart_pytorch/pytorch_funcx.py on #23 ) supports CIFAR10 and should also support all other datasets from torchvision.datasets. However, it does not work with other datasets either because of (1) the Net model class defined in the example or (2) because of the training and data processing methods defined inflox/model_trainers/PyTorchTrainer. I'd appreciate it if someone with experience in PyTorch could look into it and make the PyTorch trainer dynamic such that it works with any datasets.

@nathaniel-hudson

vinaBira commented 1 year ago

I tried running quickstart_pytorch.py file from tutorial...It is not working failing with error: File "quickstart_pytorch.py", line 134, in main() File "quickstart_pytorch.py", line 131, in main flox_controller.run_federated_learning() File "/home/cloudlabgpu1/FLoX/flox/controllers/MainController.py", line 565, in run_federated_learning results = self.on_model_receive( File "/home/cloudlabgpu1/FLoX/flox/controllers/MainController.py", line 426, in on_model_receive res = task_data.future.result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.get_result() File "/usr/lib/python3.8/concurrent/futures/_base.py", line 389, in get_result raise self._exception File "/usr/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, self.kwargs) File "/home/cloudlabgpu1/FLoX/flox/clients/MainClient.py", line 119, in run_round fit_results = self.on_model_fit(model_trainer, config, processed_training_data) File "/home/cloudlabgpu1/FLoX/flox/clients/PyTorchClient.py", line 60, in on_model_fit model_weights = model_trainer.fit(training_data, config) File "/home/cloudlabgpu1/FLoX/flox/model_trainers/PyTorchTrainer.py", line 41, in fit outputs = self.model(images) File "/home/cloudlabgpu1/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, *kwargs) File "quickstart_pytorch.py", line 79, in forward x = self.pool(F.relu(self.conv1(x))) File "/home/cloudlabgpu1/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(input, kwargs) File "/home/cloudlabgpu1/.venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/cloudlabgpu1/.venv/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

@nikita-kotsehub In this issue, are we talking about the same?