Closed jeandut closed 2 years ago
Hello,
We agree that FedAvg, FedProx and Scaffold seem to be a good set of simple baselines of interest. To our understanding, as our goal is not yet to achieve an optimized performance on the datasets, but possibly to have high level take-aways, these original methods seem best suited.
Regarding potential other baselines, maybe the question can be phrased in the following way:
a. What are the solutions that have been proposed to tackle heterogeneity in FL? b. Which one deserves to be considered as baselines in our framework and why?
Regarding a., we think beyond the ones you suggested, the following methods are intended to tackle heterogeneity: Fednova, MIME, FedCD, FedAdams/FedAdagrad/FedYogi (Adaptive Federated Optimization). This is a preliminary list that should probably be updated.
Regarding b., we see the following potential arguments:
Cheers, Constantin and Aymeric
To initiate the discussion, we propose below a table (that anyone can participate in editing) with some algorithms meant to tackle heterogeneous FL-algorithms.
# citations | Designed for non-i.i.d. data | Major features | |
---|---|---|---|
FedAvg https://arxiv.org/pdf/1602.05629.pdf | 4504 | NO | 1) local update 2) weighted average |
MOCHA https://proceedings.neurips.cc/paper/7029-federated-multi-task-learning | 855 | YES | Alternating optimization of model weights and task relationship matrix |
FedProx https://arxiv.org/pdf/1812.06127.pdf | 807 | YES | 1) generalization of FedAvg 2) add a proximal term 3) restrict local update to be close to the initial (global) model |
Scaffold https://arxiv.org/pdf/1910.06378.pdf | 356 | YES | 1) correct client drift 2) control-variates |
FedAdams/FedYogi/FedAdagrad https://arxiv.org/abs/2003.00295 | 248 | YES | 1) federated versions of adaptive optimizer |
Cyclical Weight Transfer https://pubmed.ncbi.nlm.nih.gov/29617797/ | 178 | KINDOF | ensures that each client is sufficiently visited |
Clustered Federated Learning https://ieeexplore.ieee.org/abstract/document/9174890 | 178 | YES | Clusters silos after FL has converged |
FedNova https://arxiv.org/pdf/2007.07481.pdf | 140 | YES | 1) focus on heterogeneous number of local updates 2) FedProx and FedAvg as part. cases 3) flexibility to choose any local solver |
Ditto https://proceedings.mlr.press/v139/li21h.html | 62 | YES | Each node trains a local model at the same time as everyone jointly trains the global model. At each round, the distance from the current state of the evolving global model is used as a regularization term in the local training. |
MIME https://arxiv.org/pdf/2008.03606.pdf | 47 | YES | 1) correct client drift 2) control-variates 3) server-level optimizer state (momentum, adaptive step size) 4) for cross-device settings |
FedCD https://arxiv.org/pdf/2006.09637.pdf | 10 | YES | 1) clones and deletes models to dynamically group devices with similar data |
FEEL FREE TO UPDATE THE ABOVE TABLE.
Again, the focus of the paper being the datasets, it may just be sufficient to consider FedAvg, Scaffold and FedProx.
Constantin and Aymeric
The provided list sounds good and natural.
In the future one could also consider personalized FL approaches? For instance based on fine-tuning/MAML (eg FedAvg+ https://arxiv.org/pdf/1909.12488.pdf) or regularization to the mean (https://arxiv.org/pdf/2010.02372.pdf) which are both closely related to simple FedAvg. There are also popular Federated MTL approaches based on pairwise regularization or cluster/mixture assumptions.
I thought about adding Ditto https://proceedings.mlr.press/v139/li21h.html (58 citations) to philipco's table above, but their experiments are mainly cross-device (100+ devices). Nevertheless, they also report good results on the Vehicle dataset (23 devices) - is that too many to be considered cross-silo ?
The provided list sounds good and natural.
In the future one could also consider personalized FL approaches? For instance based on fine-tuning/MAML (eg FedAvg+ https://arxiv.org/pdf/1909.12488.pdf) or regularization to the mean (https://arxiv.org/pdf/2010.02372.pdf) which are both closely related to simple FedAvg. There are also popular Federated MTL approaches based on pairwise regularization or cluster/mixture assumptions.
We will keep that in mind it shouldn't be very hard to modify the code in that sense. FLamby was thought of to be extensible.
I thought about adding Ditto https://proceedings.mlr.press/v139/li21h.html (58 citations) to philipco's table above, but their experiments are mainly cross-device (100+ devices). Nevertheless, they also report good results on the Vehicle dataset (23 devices) - is that too many to be considered cross-silo ?
Personally I put the threshold at 50 but it was never formalized maybe this vehicle dataset is worth looking at ? Do they use natural splits ?
For information on vehicle and some other datasets with natural splits but that are more 'cross-device' than 'cross-silo' in spirit, see page 20 of http://researchers.lille.inria.fr/abellet/papers/aistats20_graph_supp.pdf The school dataset may be of interest but has 140 centers (schools)
Personally I put the threshold at 50 but it was never formalized maybe this vehicle dataset is worth looking at ? Do they use natural splits ?
@Grim-bot the vehicle dataset was already mentioned in the related works in the overleaf. We should have already enough datasets with @pmangold adding this one + @sssilvar and @AyedSamy working on IXI plus @regloeb working on TCGA-survival.
ProxSkip was accepted at ICML might be worth implementing to get some hype, but closing in the mean time.
For now we need a minima:
Given time we could add as well: