Benchmarking Track's WP4: What strategies do we benchmark ?

jeandut commented 2 years ago

For now we need a minima:

FedAvg
FedProx
Scaffold
Local training/Pooled training
Cyclical

Given time we could add as well:

Second order methods (FedNewton, etc.)
FedDyn
FedDANE
ProxSkip

philipco commented 2 years ago

Hello,

We agree that FedAvg, FedProx and Scaffold seem to be a good set of simple baselines of interest. To our understanding, as our goal is not yet to achieve an optimized performance on the datasets, but possibly to have high level take-aways, these original methods seem best suited.

Regarding potential other baselines, maybe the question can be phrased in the following way:

a. What are the solutions that have been proposed to tackle heterogeneity in FL? b. Which one deserves to be considered as baselines in our framework and why?

Regarding a., we think beyond the ones you suggested, the following methods are intended to tackle heterogeneity: Fednova, MIME, FedCD, FedAdams/FedAdagrad/FedYogi (Adaptive Federated Optimization). This is a preliminary list that should probably be updated.

Regarding b., we see the following potential arguments:

is widely recognized as a reference for heterogeneous FL. [Could be based on #citations, or your expertise]
was used as a reference in a similar paper. [e.g. Federated Learning on Non-IID Data Silos: An Experimental Study (https://arxiv.org/pdf/2102.02079.pdf) uses FedNova]
substantially differs in terms of approach from FedProx (prox term) or Scaffold (control variates), or exhibits a feature that allows to somehow ``describe’’ heterogeneity [as control variates for Scaffold]
Other ideas?

Cheers, Constantin and Aymeric

philipco commented 2 years ago

To initiate the discussion, we propose below a table (that anyone can participate in editing) with some algorithms meant to tackle heterogeneous FL-algorithms.

	# citations	Designed for non-i.i.d. data	Major features
FedAvg https://arxiv.org/pdf/1602.05629.pdf	4504	NO	1) local update 2) weighted average
MOCHA https://proceedings.neurips.cc/paper/7029-federated-multi-task-learning	855	YES	Alternating optimization of model weights and task relationship matrix
FedProx https://arxiv.org/pdf/1812.06127.pdf	807	YES	1) generalization of FedAvg 2) add a proximal term 3) restrict local update to be close to the initial (global) model
Scaffold https://arxiv.org/pdf/1910.06378.pdf	356	YES	1) correct client drift 2) control-variates
FedAdams/FedYogi/FedAdagrad https://arxiv.org/abs/2003.00295	248	YES	1) federated versions of adaptive optimizer
Cyclical Weight Transfer https://pubmed.ncbi.nlm.nih.gov/29617797/	178	KINDOF	ensures that each client is sufficiently visited
Clustered Federated Learning https://ieeexplore.ieee.org/abstract/document/9174890	178	YES	Clusters silos after FL has converged
FedNova https://arxiv.org/pdf/2007.07481.pdf	140	YES	1) focus on heterogeneous number of local updates 2) FedProx and FedAvg as part. cases 3) flexibility to choose any local solver
Ditto https://proceedings.mlr.press/v139/li21h.html	62	YES	Each node trains a local model at the same time as everyone jointly trains the global model. At each round, the distance from the current state of the evolving global model is used as a regularization term in the local training.
MIME https://arxiv.org/pdf/2008.03606.pdf	47	YES	1) correct client drift 2) control-variates 3) server-level optimizer state (momentum, adaptive step size) 4) for cross-device settings
FedCD https://arxiv.org/pdf/2006.09637.pdf	10	YES	1) clones and deletes models to dynamically group devices with similar data

FEEL FREE TO UPDATE THE ABOVE TABLE.

Again, the focus of the paper being the datasets, it may just be sufficient to consider FedAvg, Scaffold and FedProx.

Constantin and Aymeric

bellet commented 2 years ago

The provided list sounds good and natural.

In the future one could also consider personalized FL approaches? For instance based on fine-tuning/MAML (eg FedAvg+ https://arxiv.org/pdf/1909.12488.pdf) or regularization to the mean (https://arxiv.org/pdf/2010.02372.pdf) which are both closely related to simple FedAvg. There are also popular Federated MTL approaches based on pairwise regularization or cluster/mixture assumptions.

Grim-bot commented 2 years ago

I thought about adding Ditto https://proceedings.mlr.press/v139/li21h.html (58 citations) to philipco's table above, but their experiments are mainly cross-device (100+ devices). Nevertheless, they also report good results on the Vehicle dataset (23 devices) - is that too many to be considered cross-silo ?

jeandut commented 2 years ago

The provided list sounds good and natural.

In the future one could also consider personalized FL approaches? For instance based on fine-tuning/MAML (eg FedAvg+ https://arxiv.org/pdf/1909.12488.pdf) or regularization to the mean (https://arxiv.org/pdf/2010.02372.pdf) which are both closely related to simple FedAvg. There are also popular Federated MTL approaches based on pairwise regularization or cluster/mixture assumptions.

We will keep that in mind it shouldn't be very hard to modify the code in that sense. FLamby was thought of to be extensible.

I thought about adding Ditto https://proceedings.mlr.press/v139/li21h.html (58 citations) to philipco's table above, but their experiments are mainly cross-device (100+ devices). Nevertheless, they also report good results on the Vehicle dataset (23 devices) - is that too many to be considered cross-silo ?

Personally I put the threshold at 50 but it was never formalized maybe this vehicle dataset is worth looking at ? Do they use natural splits ?

bellet commented 2 years ago

For information on vehicle and some other datasets with natural splits but that are more 'cross-device' than 'cross-silo' in spirit, see page 20 of http://researchers.lille.inria.fr/abellet/papers/aistats20_graph_supp.pdf The school dataset may be of interest but has 140 centers (schools)

Grim-bot commented 2 years ago

Personally I put the threshold at 50 but it was never formalized maybe this vehicle dataset is worth looking at ? Do they use natural splits ?

If you put the limit at 50, then maybe we could consider Ditto as a strategy to include in the benchmark. It seems to work pretty well and they tested it on a data set that we can interpret as cross-silo. I added it to the table of strategies above and re-ordered that table by descending citation count.
After reading the description in the vehicle dataset paper, I believe each "node" in the data set corresponds to one physical sensor. To create this data set, they put a bunch of seismic and acoustic sensors on the ground next to a road and then drove by with various types of military vehicles. The data set is from 2004. It might not be the most interesting, well-suited, and up-to-date data set we can find. One positive about this dataset is that it's time series data from two types of sensors, so that would make FLamby more diverse...

jeandut commented 2 years ago

@Grim-bot the vehicle dataset was already mentioned in the related works in the overleaf. We should have already enough datasets with @pmangold adding this one + @sssilvar and @AyedSamy working on IXI plus @regloeb working on TCGA-survival.

jeandut commented 2 years ago

ProxSkip was accepted at ICML might be worth implementing to get some hype, but closing in the mean time.

owkin / FLamby

Benchmarking Track's WP4: What strategies do we benchmark ? #11