securefederatedai / openfl

An Open Framework for Federated Learning.
https://openfl.readthedocs.io/en/latest/index.html
Apache License 2.0
733 stars 207 forks source link

Saving and loading the trained model after the end of a federated experiment #1139

Open enrico310786 opened 1 week ago

enrico310786 commented 1 week ago

Hi,

I made some experiments of federated learning using the tutorial PyTorch_TinyImageNet at this link: https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/interactive_api/PyTorch_TinyImageNet

Everithing goes right. I have one director and two envoys. The directory is in one server and the two envoys are in two different server. During the training i see the accuracy is growing.

My question are: 1) where is saved the best model and its weghts? 2) In which format? 3) How can I load the model and use it in inference mode?

I noted the in the workspace folder, once the experimnet is ended, there is a file called "model_obj.pkl". I load the file

path_model_pkl = "model_obj.pkl" with open(path_model_pkl, 'rb') as f: model_interface = pickle.load(f) model = model_interface.model

but, if i apply this model to the images of the test set of one of the two envoy, i do not obtain the good results compatible with a trained model. So, i think that this is not the best trained model. Where is it stored at the end of the experiment?

Thanks

kta-intel commented 1 week ago

Hi @enrico310786 !

Short answer: You can access the best model with:

best_model = fl_experiment.get_best_model()
best_model.state_dict()

Then save it in its native torch format (i.e. .pt or .pth) and use it for inference as you normally would.

Long answer: OpenFL's Interactive API is actually being deprecated. There are active efforts to consolidate our API and while the director/envoy concept will likely still exist in some capacity, for now it is recommended that you either use the Task Runner API (quickstart) where the model will be saved as a .pbuf in your workspace that can be converted to its native format with fx model save or the Workflow API (quickstart) which gives you the flexibility to define how you want to define your model at each stage

enrico310786 commented 1 week ago

Hi https://github.com/kta-intel,

thank you for your answer. About the long answer, I try to update my know how to the Task Runner API or the Workflow API. About the short answer, i noted the following facts:

1) Once defined the model architecture for a regression task as

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(13, 150)
        self.fc2 = nn.Linear(150, 50)
        self.fc3 = nn.Linear(50, 1) 

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model_net = SimpleNN()

and the ModelInterface with

framework_adapter = 'openfl.plugins.frameworks_adapters.pytorch_adapter.FrameworkAdapterPlugin'
model_interface = ModelInterface(model=model_net, optimizer=optimizer, framework_plugin=framework_adapter)

I also defined an initial_model object as initial_model = deepcopy(model_net) and printing the initial_model weights I obtain

tensor([[-0.0021, 0.1488, -0.2283, ..., -0.0838, -0.0545, -0.2650], [-0.1837, -0.1143, 0.0103, ..., 0.2075, -0.0447, 0.0293], [ 0.2511, -0.2573, -0.1746, ..., -0.1619, 0.2384, 0.1238], ..., [-0.2398, 0.2194, -0.1492, ..., -0.1561, -0.0217, 0.2169], [ 0.0238, 0.1927, -0.0021, ..., 0.1863, 0.0120, 0.1169], [ 0.1160, -0.2394, -0.2438, ..., 0.2573, 0.2502, -0.1769]]).....

2) Once the experiment is finished, printing the weights of fl_experiment.get_best_model(), I obtain the same weights of the initial_model. Shouldn't the weights be different since the model is trained now? Furthermore, if I used the fl_experiment.get_best_model() on the test set of one envoy, I obtain bad results (high MSE and low R^2). All these facts indicate to me that fl_experiment.get_best_model() it is not the best trained model.

3) If I use instead fl_experiment.get_last_model(), now the weights are different respect to the initial_model tensor([[ 2.7470, 1.7603, 0.4984, ..., 0.4626, -2.6358, -1.7808], [ 2.5191, -0.3715, 2.2026, ..., -0.9344, -0.8067, -0.1721], [ 0.7177, -0.7920, 0.3306, ..., -1.0026, -1.1008, -0.2933], ..., [ 1.1725, -0.4773, 1.3435, ..., -1.2107, -0.7849, -0.0271], [ 2.8732, 0.3654, 1.6125, ..., -0.7965, -0.7755, -0.0415], [ 4.6579, 0.2192, -0.2842, ..., -1.1465, -1.3399, -0.7404]])....

and applying the fl_experiment.get_last_model() on the test set of one envoy, I obtain good result (low MSE and high R^2). But I thing that fl_experiment.get_last_model() is the latest model at the final run not the best one.

Why does fl_experiment.get_best_model() give me the initial_model weights and not those of the best one?

Thanks again, Enrico

kta-intel commented 1 week ago

Thanks for the investigation.

Hmm, this might be a bug (or at the very least, insufficient checkpointing logic). My suspicion is that the the Interactive API backend is using a loss criteria on the validation step to check for the best model, but since the validate function is measuring for accuracy, it is marking the higher value as worse and not saving it. On the other hand, as you said, .get_last_model() is just a running set of weights for the latest run, so the training is being reflected, albeit not the best state.

This is actually more of an educated guess based on similar issues in the past. I have to dive a bit deeper to confirm, though.

teoparvanov commented 1 week ago

Hi @enrico310786, as @kta-intel mentioned earlier, we are in the process of deprecating Interactive API. Is the issue also reproducible with Task Runner API, or Workflow API?

enrico310786 commented 1 week ago

Hi @teoparvanov, I don't know. So far I just used the Interactive API framework.

teoparvanov commented 4 days ago

@enrico310786 , here are a couple of resources to get you started with the main OpenFL API-s:

Please keep us posted on how this is going. The Slack #onboarding channel is a good place to get additional support.