2. Test ELM performance with different number of data points, and make a plot of "test performance vs. amount of training data"

akusok commented 10 months ago

The goal of federated learning is to build a better model by using more data (from other organizations). If the model does not improve with more data, there is no point building a federated learning.

Here we will test how the model performance improves with more data. We test 2 scenarios:

Randomly distributed data. Not very realistic but a typical setup in experiments.
Non-randomly distribute data. Very realistic, for example hospitals from different countries will have different types of patients and different equipment; their patient data is not "randomly distributed". We will simulate non-random distribution by splitting geographical data with its coordinates.

The idea is to start from a small number of training data points, then add more training data and check how it affects the model performance. We can take 50 data samples and add 50 more samples at a time. Two plots: randomly distributed data and non-randomly geographically distributed data.

Steps:

[ ] (@akusok) Get an example graph and setup
[ ] (@tamiratGit) Repeat experiments many times to get nice average numbers, and the confidence intervals of the numbers.

akusok commented 10 months ago

Idea: find out how good/bad of a model each client can create, if they don’t share data. This will be the baseline before we look into the Federated ELM. This should help us explain why federated ELM is so good for clients with little data; and actually measure how good it is in terms of accuracy improvement.

Separately for each “client”

[x] take only training and test sets from that client
[x] Find the best ELM parameters for EACH number dataset
[x] start training ELM with 10 training samples, add 10 more at a time until we use all training samples
[x] compute test score on this client’s test set
[ ] repeat several times with random order of training points
[x] make a plot of performance vs number of training data

akusok commented 10 months ago

test:

number of neurons: 10, 20, 30, 45, 70, 100
L2 alpha: 10-2, 10-3, 10-4

for every number of training data points: 10, 20, 30, 40, … build 15 ELMs like this, average their accuracy for every combination of (L2, neurons), then take the parameters with the best accuracy

accuracy = {} for l in (10, 20, 30, 45, 70, 100): for L2 in (1e-2, 1e-3, 1e-4): accuracy[l for run in range(15): accuracy … (rest shown in code notebook)

akusok commented 10 months ago

Find the best ELM parameters for EACH number of training points: 10, 20, 30, 40,… Then use these parameters, and get the plots of performance vs amount of training data

tamiratGit / FedELM

2. Test ELM performance with different number of data points, and make a plot of "test performance vs. amount of training data" #2